action #106912: Fullstack test can still fail due to `shutdown` module size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #106912

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errors

Fullstack test can still fail due to `shutdown` module size:M

Added by mkittler about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-02-16

Due date:

% Done:

Estimated time:

Description

Observation¶

This is is a shortened version of the relevant log:

    # [2022-02-15T12:30:06.683729Z] [debug] ||| starting boot tests/boot.pm
    # [2022-02-15T12:30:06.685985Z] [debug] tests/boot.pm:9 called testapi::assert_screen
    # [2022-02-15T12:30:06.686361Z] [debug] <<< testapi::assert_screen(mustmatch="core", timeout=15, no_wait=1)
    # [2022-02-15T12:30:06.700419Z] [debug] no match: 14.9s, best candidate: boot-core (0.00)
…
    # [2022-02-15T12:30:07.125518Z] [debug] no change: 14.5s
    # [2022-02-15T12:30:07.134823Z] [debug] no match: 14.5s, best candidate: boot-core (0.00)
    # [2022-02-15T12:30:07.331434Z] [debug] >>> testapi::_handle_found_needle: found boot-core, similarity 0.97 @ 1/74
    # [2022-02-15T12:30:07.332261Z] [debug] tests/boot.pm:10 called testapi::send_key
    # [2022-02-15T12:30:07.332397Z] [debug] <<< testapi::send_key(key="ret", wait_screen_change=0, do_wait=0)
    # [2022-02-15T12:30:07.602261Z] [debug] tests/boot.pm:14 called testapi::assert_screen
    # [2022-02-15T12:30:07.602421Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=90)
    # [2022-02-15T12:30:08.241080Z] [debug] no match: 89.3s, best candidate: boot-on_prompt (0.00)
…
    # [2022-02-15T12:30:21.249634Z] [debug] no match: 76.3s, best candidate: boot-on_prompt (0.00)
    # [2022-02-15T12:30:22.370453Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
    # [2022-02-15T12:30:22.374020Z] [debug] ||| finished boot tests (runtime: 16 s)
    # [2022-02-15T12:30:22.375138Z] [debug] ||| starting assert_screen tests/assert_screen.pm
    # [2022-02-15T12:30:22.375777Z] [debug] tests/assert_screen.pm:9 called testapi::assert_screen
    # [2022-02-15T12:30:22.375889Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=60)
    # [2022-02-15T12:30:23.323411Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
    # [2022-02-15T12:30:23.323585Z] [debug] tests/assert_screen.pm:10 called testapi::assert_screen
    # [2022-02-15T12:30:23.323700Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=30, no_wait=1)
    # [2022-02-15T12:30:23.397251Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
    # [2022-02-15T12:30:23.398459Z] [debug] ||| finished assert_screen tests (runtime: 1 s)
    # [2022-02-15T12:30:23.399556Z] [debug] ||| starting shutdown tests/shutdown.pm
    # [2022-02-15T12:30:23.400299Z] [debug] tests/shutdown.pm:8 called testapi::type_string
    # [2022-02-15T12:30:23.400468Z] [debug] <<< testapi::type_string(string="sudo su\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
    # [2022-02-15T12:30:23.671645Z] [debug] tests/shutdown.pm:9 called testapi::type_string
    # [2022-02-15T12:30:23.671846Z] [debug] <<< testapi::type_string(string="poweroff\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
    # [2022-02-15T12:30:23.976613Z] [debug] tests/shutdown.pm:10 called testapi::assert_shutdown
    # [2022-02-15T12:30:23.976768Z] [debug] <<< testapi::check_shutdown(timeout=90)
    # [2022-02-15T12:30:23.977601Z] [debug] QEMU status is not 'shutdown', it is 'running'
    # [2022-02-15T12:30:24.978574Z] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[2]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":798568,"seconds":1644928224}}
    # [2022-02-15T12:30:24.978769Z] [debug] QEMU status is not 'shutdown', it is 'running'
…
    # [2022-02-15T12:31:54.110983Z] [debug] QEMU status is not 'shutdown', it is 'running'
    # [2022-02-15T12:31:55.174986Z] [info] ::: basetest::runtest: # Test died: Machine didn't shut down! at tinycore/tests/shutdown.pm line 10.
    #   
    # [2022-02-15T12:31:55.177594Z] [debug] ||| finished shutdown tests (runtime: 92 s)
    # [2022-02-15T12:31:55.177662Z] [debug] ||| post fail hooks runtime: 0 s

The test run was conducted as "job 7" as part of openQA's fullstack test (with caching enabled). I suppose it should also be reproducible in other "fullstack scenarios" where we expect the test to pass. Maybe os-autoinst's own fullstack test is not affected because it runs a few more test modules before shutdown. The change https://github.com/os-autoinst/os-autoinst/commit/a7103b16 was supposed to fix that and it definitely improved the situation. However, it seems in very rare cases¹ the test can still fail. (The fix was definitely used here as one can see in the logs that the assert_screen module is successfully executed.)

¹ I ran openQA's fullstack test about 75 times in CircleCI until I hit an occurrence.

Acceptance criteria¶

AC1: openQA fullstack test runs 100 times without failing

Suggestions¶

Not Ondrej's fault
Make it fail on purpose... but we don't know the exact cause, only the symptom
Reveal the video from the pool directory after reproducing locally

Out of scope¶

Reveal the video from the pool directory in CI

Files

Download all files

fullstack-shutdown-again.txt (42.5 KB) fullstack-shutdown-again.txt		mkittler, 2022-02-16 11:59
full-stack-shutdown-again-detailed-log.txt (222 KB) full-stack-shutdown-again-detailed-log.txt		mkittler, 2022-02-16 11:59
fullstack-shutdown-problem-video.ogv (128 KB) fullstack-shutdown-problem-video.ogv		mkittler, 2022-02-22 11:58
mpv-shot0011.jpg (41.6 KB) mpv-shot0011.jpg		mkittler, 2022-02-22 12:11
mpv-shot0013.jpg (53.7 KB) mpv-shot0013.jpg		mkittler, 2022-02-22 12:11
mpv-shot0015.jpg (35.1 KB) mpv-shot0015.jpg		mkittler, 2022-02-22 12:11
mpv-shot0019.jpg (72.4 KB) mpv-shot0019.jpg		mkittler, 2022-02-22 12:11
mpv-shot0020.jpg (44.5 KB) mpv-shot0020.jpg		mkittler, 2022-02-22 12:11
mpv-shot0021.jpg (21.8 KB) mpv-shot0021.jpg		mkittler, 2022-02-22 12:11
mpv-shot0022.jpg (31 KB) mpv-shot0022.jpg		mkittler, 2022-02-22 12:11
mpv-shot0023.jpg (31.6 KB) mpv-shot0023.jpg		mkittler, 2022-02-22 12:11
mpv-shot0024.jpg (35.1 KB) mpv-shot0024.jpg		mkittler, 2022-02-22 12:11

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 3 years ago

Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by okurz about 3 years ago

Tracker changed from coordination to action

Actions

Copy link

Updated by livdywan about 3 years ago

Subject changed from Fullstack test can still fail due to `shutdown` module to Fullstack test can still fail due to `shutdown` module size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by livdywan about 3 years ago

Blocked by action #107002: Expose fullstack test video from pool directory in CI size:M added

Actions

Copy link

Updated by livdywan about 3 years ago

Status changed from Workable to Blocked

Actions

Copy link Download all files

Updated by mkittler about 3 years ago

File fullstack-shutdown-problem-video.ogv fullstack-shutdown-problem-video.ogv added
File mpv-shot0011.jpg mpv-shot0011.jpg added
File mpv-shot0013.jpg mpv-shot0013.jpg added
File mpv-shot0015.jpg mpv-shot0015.jpg added
File mpv-shot0019.jpg mpv-shot0019.jpg added
File mpv-shot0020.jpg mpv-shot0020.jpg added
File mpv-shot0021.jpg mpv-shot0021.jpg added
File mpv-shot0022.jpg mpv-shot0022.jpg added
File mpv-shot0023.jpg mpv-shot0023.jpg added
File mpv-shot0024.jpg mpv-shot0024.jpg added

Thanks to @cdywan's work on #107002 we now have a video. It shows that poweroff is typed before the prompt "within" sudo is ready and one can also shortly see some boot messages after reaching the prompt. The typing poweroff actually nevertheless appears within the prompt but I suppose the return key didn't make it and thus nothing happens.

I've been attaching the most important frames of the video (they're ordered despite some numbers being omitted).

So we see:

Boot logging
Prompt with keys for sudo and poweroff already being typed
Prompt is gone again, further boot messages appear (and not it doesn't look like shutdown messages)
Prompt is shown again but it is messed up by further boot messages
Screen is cleared, just a single "_" appears
We're back at a clean prompt
The keystrokes for sudo and poweroff are "re-typed" or "replied"
That's it, apparently the poweroff command is there but never submitted or has no effect

I'd say there are no typing issues on our side. It is also clear why waiting for the prompt doesn't do the trick. The prompt is already shown while the system is booting. Maybe we need to assert that the boot prompt is shown for at least e.g. 5 seconds to be sure the system is really ready. That would be really annoying. I'm afraid we cannot assert for the last boot message because we might not catch it at all. I'm also wondering why 7. happens.

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from Blocked to Feedback
Assignee set to okurz

@cdywan please only "Blocked" with assignee so that we know who tracks the blocker. I set you as assignee now. (EDIT: took assignee myself now because I think it's not blocked anymore)

mkittler wrote:

Thanks to @cdywan's work on #107002 we now have a video. It shows that poweroff is typed before the prompt "within" sudo is ready and one can also shortly see some boot messages after reaching the prompt. The typing poweroff actually nevertheless appears within the prompt but I suppose the return key didn't make it and thus nothing happens.
[…]
I'd say there are no typing issues on our side. It is also clear why waiting for the prompt doesn't do the trick. The prompt is already shown while the system is booting. Maybe we need to assert that the boot prompt is shown for at least e.g. 5 seconds to be sure the system is really ready. That would be really annoying. I'm afraid we cannot assert for the last boot message because we might not catch it at all.

Couple of different ideas to try

Use "power" function from test API, see https://github.com/os-autoinst/os-autoinst/blob/master/testapi.pm#L1902
In https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/79/files#diff-1e64619f043972b0dff7b9d7ecde066f55e04b2fcb307beb4ead0e03dc69b6a6R8 we do a different approach by using a "root" console, not switching just in before with "su"
Instead of ending up with typing "sudo su" and then typing (too soon) "poweroff" we should just type "sudo poweroff"

Trying that now myself: https://github.com/os-autoinst/os-autoinst/pull/1963

I'm also wondering why 7. happens.

I think this is normal behaviour. If you type while the next prompt isn't ready characters are echoed onto the (raw) console. But as soon as the prompt appears (and your PS command and stuff is finished processing) then the previously typed text is repeated from the buffer into the prompt line. I would not worry about that.

Actions

Copy link

Updated by okurz about 3 years ago

I triggered the openQA full stack test manually both in os-autoinst https://github.com/os-autoinst/os-autoinst/runs/5323123216?check_suite_focus=true as well as openQA https://github.com/os-autoinst/openQA/runs/5323139061?check_suite_focus=true with the "fail if any retry fails". Locally I am running runs=200 count_fail_ratio prove -I. t/99-full-stack.t

Both CI runs failed due to updated dependency packages. So I triggered another dependency PR run from within https://app.circleci.com/pipelines/github/os-autoinst/openQA?branch=master&filter=all selecting a "nightly" workflow job. This triggered https://app.circleci.com/pipelines/github/os-autoinst/openQA/9058/workflows/2e92004c-e279-4c95-a130-6f857b6b3ee5/jobs/85580 which created https://github.com/os-autoinst/openQA/pull/4531

Actions

Copy link

Updated by okurz about 3 years ago

Blocked by deleted (action #107002: Expose fullstack test video from pool directory in CI size:M)

Actions

Copy link

#10

Updated by okurz about 3 years ago

Related to action #107002: Expose fullstack test video from pool directory in CI size:M added

Actions

Copy link

#11

Updated by okurz about 3 years ago

Status changed from Feedback to Resolved

https://github.com/os-autoinst/openQA/runs/5324376037?check_suite_focus=true#step:7:117 failed after 5 successful runs with

Rerun 5 of 15 …
Calling retry hook ./tools/delete-coverdb-folder
Deleting cover_db_fullstack/
DBIx::Class::Storage::DBI::_exec_txn_commit(): DBI Exception: DBD::Pg::db commit failed: PANIC:  could not write to file "pg_wal/xlogtemp.2980": No space left on device
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
[20:18:19] t/full-stack.t .. 
No subtests run 
[20:18:19]

maybe saving all artifacts won't scale that good :) But the original problem did not appear here.

The very same consistently was reproduced in https://github.com/os-autoinst/os-autoinst/runs/5324381907?check_suite_focus=true#step:8:93 . I don't feel like triggering the above two 10 times more. But locally I have finished 90 runs successfully without any error so I declare this good enough :)https://github.com/os-autoinst/openQA/runs/5324376037?check_suite_focus=true#step:7:117 failed after 5 successful runs with

Rerun 5 of 15 …
Calling retry hook ./tools/delete-coverdb-folder
Deleting cover_db_fullstack/
DBIx::Class::Storage::DBI::_exec_txn_commit(): DBI Exception: DBD::Pg::db commit failed: PANIC:  could not write to file "pg_wal/xlogtemp.2980": No space left on device
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
[20:18:19] t/full-stack.t .. 
No subtests run 
[20:18:19]

maybe saving all artifacts won't scale that good :) But the original problem did not appear here.

The very same consistently was reproduced in https://github.com/os-autoinst/os-autoinst/runs/5324381907?check_suite_focus=true#step:8:93 . I don't feel like triggering the above two 10 times more. But locally I have finished 130 runs successfully without any error so I declare this good enough :)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #106912

Fullstack test can still fail due to `shutdown` module size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan about 3 years ago

Updated by livdywan about 3 years ago

Updated by livdywan about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago