action #106912
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errors
Fullstack test can still fail due to `shutdown` module size:M
Description
Observation¶
This is is a shortened version of the relevant log:
# [2022-02-15T12:30:06.683729Z] [debug] ||| starting boot tests/boot.pm
# [2022-02-15T12:30:06.685985Z] [debug] tests/boot.pm:9 called testapi::assert_screen
# [2022-02-15T12:30:06.686361Z] [debug] <<< testapi::assert_screen(mustmatch="core", timeout=15, no_wait=1)
# [2022-02-15T12:30:06.700419Z] [debug] no match: 14.9s, best candidate: boot-core (0.00)
…
# [2022-02-15T12:30:07.125518Z] [debug] no change: 14.5s
# [2022-02-15T12:30:07.134823Z] [debug] no match: 14.5s, best candidate: boot-core (0.00)
# [2022-02-15T12:30:07.331434Z] [debug] >>> testapi::_handle_found_needle: found boot-core, similarity 0.97 @ 1/74
# [2022-02-15T12:30:07.332261Z] [debug] tests/boot.pm:10 called testapi::send_key
# [2022-02-15T12:30:07.332397Z] [debug] <<< testapi::send_key(key="ret", wait_screen_change=0, do_wait=0)
# [2022-02-15T12:30:07.602261Z] [debug] tests/boot.pm:14 called testapi::assert_screen
# [2022-02-15T12:30:07.602421Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=90)
# [2022-02-15T12:30:08.241080Z] [debug] no match: 89.3s, best candidate: boot-on_prompt (0.00)
…
# [2022-02-15T12:30:21.249634Z] [debug] no match: 76.3s, best candidate: boot-on_prompt (0.00)
# [2022-02-15T12:30:22.370453Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
# [2022-02-15T12:30:22.374020Z] [debug] ||| finished boot tests (runtime: 16 s)
# [2022-02-15T12:30:22.375138Z] [debug] ||| starting assert_screen tests/assert_screen.pm
# [2022-02-15T12:30:22.375777Z] [debug] tests/assert_screen.pm:9 called testapi::assert_screen
# [2022-02-15T12:30:22.375889Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=60)
# [2022-02-15T12:30:23.323411Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
# [2022-02-15T12:30:23.323585Z] [debug] tests/assert_screen.pm:10 called testapi::assert_screen
# [2022-02-15T12:30:23.323700Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=30, no_wait=1)
# [2022-02-15T12:30:23.397251Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
# [2022-02-15T12:30:23.398459Z] [debug] ||| finished assert_screen tests (runtime: 1 s)
# [2022-02-15T12:30:23.399556Z] [debug] ||| starting shutdown tests/shutdown.pm
# [2022-02-15T12:30:23.400299Z] [debug] tests/shutdown.pm:8 called testapi::type_string
# [2022-02-15T12:30:23.400468Z] [debug] <<< testapi::type_string(string="sudo su\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
# [2022-02-15T12:30:23.671645Z] [debug] tests/shutdown.pm:9 called testapi::type_string
# [2022-02-15T12:30:23.671846Z] [debug] <<< testapi::type_string(string="poweroff\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
# [2022-02-15T12:30:23.976613Z] [debug] tests/shutdown.pm:10 called testapi::assert_shutdown
# [2022-02-15T12:30:23.976768Z] [debug] <<< testapi::check_shutdown(timeout=90)
# [2022-02-15T12:30:23.977601Z] [debug] QEMU status is not 'shutdown', it is 'running'
# [2022-02-15T12:30:24.978574Z] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[2]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":798568,"seconds":1644928224}}
# [2022-02-15T12:30:24.978769Z] [debug] QEMU status is not 'shutdown', it is 'running'
…
# [2022-02-15T12:31:54.110983Z] [debug] QEMU status is not 'shutdown', it is 'running'
# [2022-02-15T12:31:55.174986Z] [info] ::: basetest::runtest: # Test died: Machine didn't shut down! at tinycore/tests/shutdown.pm line 10.
#
# [2022-02-15T12:31:55.177594Z] [debug] ||| finished shutdown tests (runtime: 92 s)
# [2022-02-15T12:31:55.177662Z] [debug] ||| post fail hooks runtime: 0 s
The test run was conducted as "job 7" as part of openQA's fullstack test (with caching enabled). I suppose it should also be reproducible in other "fullstack scenarios" where we expect the test to pass. Maybe os-autoinst's own fullstack test is not affected because it runs a few more test modules before shutdown
. The change https://github.com/os-autoinst/os-autoinst/commit/a7103b16 was supposed to fix that and it definitely improved the situation. However, it seems in very rare cases¹ the test can still fail. (The fix was definitely used here as one can see in the logs that the assert_screen
module is successfully executed.)
¹ I ran openQA's fullstack test about 75 times in CircleCI until I hit an occurrence.
Acceptance criteria¶
- AC1: openQA fullstack test runs 100 times without failing
Suggestions¶
- Not Ondrej's fault
- Make it fail on purpose... but we don't know the exact cause, only the symptom
- Reveal the video from the pool directory after reproducing locally
Out of scope¶
- Reveal the video from the pool directory in CI
Files