action #106912
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errors
Fullstack test can still fail due to `shutdown` module size:M
Description
Observation¶
This is is a shortened version of the relevant log:
# [2022-02-15T12:30:06.683729Z] [debug] ||| starting boot tests/boot.pm
# [2022-02-15T12:30:06.685985Z] [debug] tests/boot.pm:9 called testapi::assert_screen
# [2022-02-15T12:30:06.686361Z] [debug] <<< testapi::assert_screen(mustmatch="core", timeout=15, no_wait=1)
# [2022-02-15T12:30:06.700419Z] [debug] no match: 14.9s, best candidate: boot-core (0.00)
…
# [2022-02-15T12:30:07.125518Z] [debug] no change: 14.5s
# [2022-02-15T12:30:07.134823Z] [debug] no match: 14.5s, best candidate: boot-core (0.00)
# [2022-02-15T12:30:07.331434Z] [debug] >>> testapi::_handle_found_needle: found boot-core, similarity 0.97 @ 1/74
# [2022-02-15T12:30:07.332261Z] [debug] tests/boot.pm:10 called testapi::send_key
# [2022-02-15T12:30:07.332397Z] [debug] <<< testapi::send_key(key="ret", wait_screen_change=0, do_wait=0)
# [2022-02-15T12:30:07.602261Z] [debug] tests/boot.pm:14 called testapi::assert_screen
# [2022-02-15T12:30:07.602421Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=90)
# [2022-02-15T12:30:08.241080Z] [debug] no match: 89.3s, best candidate: boot-on_prompt (0.00)
…
# [2022-02-15T12:30:21.249634Z] [debug] no match: 76.3s, best candidate: boot-on_prompt (0.00)
# [2022-02-15T12:30:22.370453Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
# [2022-02-15T12:30:22.374020Z] [debug] ||| finished boot tests (runtime: 16 s)
# [2022-02-15T12:30:22.375138Z] [debug] ||| starting assert_screen tests/assert_screen.pm
# [2022-02-15T12:30:22.375777Z] [debug] tests/assert_screen.pm:9 called testapi::assert_screen
# [2022-02-15T12:30:22.375889Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=60)
# [2022-02-15T12:30:23.323411Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
# [2022-02-15T12:30:23.323585Z] [debug] tests/assert_screen.pm:10 called testapi::assert_screen
# [2022-02-15T12:30:23.323700Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=30, no_wait=1)
# [2022-02-15T12:30:23.397251Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
# [2022-02-15T12:30:23.398459Z] [debug] ||| finished assert_screen tests (runtime: 1 s)
# [2022-02-15T12:30:23.399556Z] [debug] ||| starting shutdown tests/shutdown.pm
# [2022-02-15T12:30:23.400299Z] [debug] tests/shutdown.pm:8 called testapi::type_string
# [2022-02-15T12:30:23.400468Z] [debug] <<< testapi::type_string(string="sudo su\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
# [2022-02-15T12:30:23.671645Z] [debug] tests/shutdown.pm:9 called testapi::type_string
# [2022-02-15T12:30:23.671846Z] [debug] <<< testapi::type_string(string="poweroff\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
# [2022-02-15T12:30:23.976613Z] [debug] tests/shutdown.pm:10 called testapi::assert_shutdown
# [2022-02-15T12:30:23.976768Z] [debug] <<< testapi::check_shutdown(timeout=90)
# [2022-02-15T12:30:23.977601Z] [debug] QEMU status is not 'shutdown', it is 'running'
# [2022-02-15T12:30:24.978574Z] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[2]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":798568,"seconds":1644928224}}
# [2022-02-15T12:30:24.978769Z] [debug] QEMU status is not 'shutdown', it is 'running'
…
# [2022-02-15T12:31:54.110983Z] [debug] QEMU status is not 'shutdown', it is 'running'
# [2022-02-15T12:31:55.174986Z] [info] ::: basetest::runtest: # Test died: Machine didn't shut down! at tinycore/tests/shutdown.pm line 10.
#
# [2022-02-15T12:31:55.177594Z] [debug] ||| finished shutdown tests (runtime: 92 s)
# [2022-02-15T12:31:55.177662Z] [debug] ||| post fail hooks runtime: 0 s
The test run was conducted as "job 7" as part of openQA's fullstack test (with caching enabled). I suppose it should also be reproducible in other "fullstack scenarios" where we expect the test to pass. Maybe os-autoinst's own fullstack test is not affected because it runs a few more test modules before shutdown
. The change https://github.com/os-autoinst/os-autoinst/commit/a7103b16 was supposed to fix that and it definitely improved the situation. However, it seems in very rare cases¹ the test can still fail. (The fix was definitely used here as one can see in the logs that the assert_screen
module is successfully executed.)
¹ I ran openQA's fullstack test about 75 times in CircleCI until I hit an occurrence.
Acceptance criteria¶
- AC1: openQA fullstack test runs 100 times without failing
Suggestions¶
- Not Ondrej's fault
- Make it fail on purpose... but we don't know the exact cause, only the symptom
- Reveal the video from the pool directory after reproducing locally
Out of scope¶
- Reveal the video from the pool directory in CI
Files
Updated by okurz almost 3 years ago
- Priority changed from Normal to High
- Target version set to Ready
Updated by livdywan almost 3 years ago
- Subject changed from Fullstack test can still fail due to `shutdown` module to Fullstack test can still fail due to `shutdown` module size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan almost 3 years ago
- Blocked by action #107002: Expose fullstack test video from pool directory in CI size:M added
Updated by mkittler almost 3 years ago
- File fullstack-shutdown-problem-video.ogv fullstack-shutdown-problem-video.ogv added
- File mpv-shot0011.jpg mpv-shot0011.jpg added
- File mpv-shot0013.jpg mpv-shot0013.jpg added
- File mpv-shot0015.jpg mpv-shot0015.jpg added
- File mpv-shot0019.jpg mpv-shot0019.jpg added
- File mpv-shot0020.jpg mpv-shot0020.jpg added
- File mpv-shot0021.jpg mpv-shot0021.jpg added
- File mpv-shot0022.jpg mpv-shot0022.jpg added
- File mpv-shot0023.jpg mpv-shot0023.jpg added
- File mpv-shot0024.jpg mpv-shot0024.jpg added
Thanks to @cdywan's work on #107002 we now have a video. It shows that poweroff
is typed before the prompt "within" sudo
is ready and one can also shortly see some boot messages after reaching the prompt. The typing poweroff
actually nevertheless appears within the prompt but I suppose the return key didn't make it and thus nothing happens.
I've been attaching the most important frames of the video (they're ordered despite some numbers being omitted).
So we see:
- Boot logging
- Prompt with keys for sudo and poweroff already being typed
- Prompt is gone again, further boot messages appear (and not it doesn't look like shutdown messages)
- Prompt is shown again but it is messed up by further boot messages
- Screen is cleared, just a single "_" appears
- We're back at a clean prompt
- The keystrokes for sudo and poweroff are "re-typed" or "replied"
- That's it, apparently the poweroff command is there but never submitted or has no effect
I'd say there are no typing issues on our side. It is also clear why waiting for the prompt doesn't do the trick. The prompt is already shown while the system is booting. Maybe we need to assert that the boot prompt is shown for at least e.g. 5 seconds to be sure the system is really ready. That would be really annoying. I'm afraid we cannot assert for the last boot message because we might not catch it at all. I'm also wondering why 7. happens.
Updated by okurz almost 3 years ago
- Status changed from Blocked to Feedback
- Assignee set to okurz
@cdywan please only "Blocked" with assignee so that we know who tracks the blocker. I set you as assignee now. (EDIT: took assignee myself now because I think it's not blocked anymore)
mkittler wrote:
Thanks to @cdywan's work on #107002 we now have a video. It shows that
poweroff
is typed before the prompt "within"sudo
is ready and one can also shortly see some boot messages after reaching the prompt. The typingpoweroff
actually nevertheless appears within the prompt but I suppose the return key didn't make it and thus nothing happens.
[…]
I'd say there are no typing issues on our side. It is also clear why waiting for the prompt doesn't do the trick. The prompt is already shown while the system is booting. Maybe we need to assert that the boot prompt is shown for at least e.g. 5 seconds to be sure the system is really ready. That would be really annoying. I'm afraid we cannot assert for the last boot message because we might not catch it at all.
Couple of different ideas to try
- Use "power" function from test API, see https://github.com/os-autoinst/os-autoinst/blob/master/testapi.pm#L1902
- In https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/79/files#diff-1e64619f043972b0dff7b9d7ecde066f55e04b2fcb307beb4ead0e03dc69b6a6R8 we do a different approach by using a "root" console, not switching just in before with "su"
- Instead of ending up with typing "sudo su" and then typing (too soon) "poweroff" we should just type "sudo poweroff"
Trying that now myself: https://github.com/os-autoinst/os-autoinst/pull/1963
I'm also wondering why 7. happens.
I think this is normal behaviour. If you type while the next prompt isn't ready characters are echoed onto the (raw) console. But as soon as the prompt appears (and your PS command and stuff is finished processing) then the previously typed text is repeated from the buffer into the prompt line. I would not worry about that.
Updated by okurz almost 3 years ago
I triggered the openQA full stack test manually both in os-autoinst https://github.com/os-autoinst/os-autoinst/runs/5323123216?check_suite_focus=true as well as openQA https://github.com/os-autoinst/openQA/runs/5323139061?check_suite_focus=true with the "fail if any retry fails". Locally I am running runs=200 count_fail_ratio prove -I. t/99-full-stack.t
Both CI runs failed due to updated dependency packages. So I triggered another dependency PR run from within https://app.circleci.com/pipelines/github/os-autoinst/openQA?branch=master&filter=all selecting a "nightly" workflow job. This triggered https://app.circleci.com/pipelines/github/os-autoinst/openQA/9058/workflows/2e92004c-e279-4c95-a130-6f857b6b3ee5/jobs/85580 which created https://github.com/os-autoinst/openQA/pull/4531
Updated by okurz almost 3 years ago
- Blocked by deleted (action #107002: Expose fullstack test video from pool directory in CI size:M)
Updated by okurz almost 3 years ago
- Related to action #107002: Expose fullstack test video from pool directory in CI size:M added
Updated by okurz almost 3 years ago
- Status changed from Feedback to Resolved
https://github.com/os-autoinst/openQA/runs/5324376037?check_suite_focus=true#step:7:117 failed after 5 successful runs with
Rerun 5 of 15 …
Calling retry hook ./tools/delete-coverdb-folder
Deleting cover_db_fullstack/
DBIx::Class::Storage::DBI::_exec_txn_commit(): DBI Exception: DBD::Pg::db commit failed: PANIC: could not write to file "pg_wal/xlogtemp.2980": No space left on device
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
[20:18:19] t/full-stack.t ..
No subtests run
[20:18:19]
maybe saving all artifacts won't scale that good :) But the original problem did not appear here.
The very same consistently was reproduced in https://github.com/os-autoinst/os-autoinst/runs/5324381907?check_suite_focus=true#step:8:93 . I don't feel like triggering the above two 10 times more. But locally I have finished 90 runs successfully without any error so I declare this good enough :)https://github.com/os-autoinst/openQA/runs/5324376037?check_suite_focus=true#step:7:117 failed after 5 successful runs with
Rerun 5 of 15 …
Calling retry hook ./tools/delete-coverdb-folder
Deleting cover_db_fullstack/
DBIx::Class::Storage::DBI::_exec_txn_commit(): DBI Exception: DBD::Pg::db commit failed: PANIC: could not write to file "pg_wal/xlogtemp.2980": No space left on device
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
[20:18:19] t/full-stack.t ..
No subtests run
[20:18:19]
maybe saving all artifacts won't scale that good :) But the original problem did not appear here.
The very same consistently was reproduced in https://github.com/os-autoinst/os-autoinst/runs/5324381907?check_suite_focus=true#step:8:93 . I don't feel like triggering the above two 10 times more. But locally I have finished 130 runs successfully without any error so I declare this good enough :)