action #105429
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errors
openQA's fullstack test fails in `shutdown` module
Description
Observation¶
See https://github.com/os-autoinst/openQA/pull/4468#issuecomment-1019201379, the error is:
# [2022-01-22T08:07:01.494359Z] [debug] [pid:488] REST-API call: POST http://localhost:9526/api/v1/jobs/1/status
# [2022-01-22T08:07:01.528597Z] [info] [pid:831] Uploading video.ogv
# [2022-01-22T08:07:01.529126Z] [debug] [pid:831] Uploading artefact video.ogv
# [2022-01-22T08:07:01.566264Z] [info] [pid:831] Uploading vars.json
# [2022-01-22T08:07:01.566529Z] [debug] [pid:831] Uploading artefact vars.json
# [2022-01-22T08:07:01.594899Z] [info] [pid:831] Uploading autoinst-log.txt
# [2022-01-22T08:07:01.595178Z] [debug] [pid:831] Uploading artefact autoinst-log.txt
# [2022-01-22T08:07:01.621786Z] [info] [pid:831] Uploading worker-log.txt
# [2022-01-22T08:07:01.622061Z] [debug] [pid:831] Uploading artefact worker-log.txt
Bailout called. Further testing stopped: Job 1 produced the wrong results
# Failed test 'no (unexpected) warnings (via END block)'
# at /usr/lib/perl5/5.26.1/Test/Builder.pm line 135.
# Got the following unexpected warnings:
# 1: Use of uninitialized value in subroutine entry at t/full-stack.t line 213.
The Use of uninitialized value…
is basically because the qcow image hasn't been uploaded. Those are the corresponding failing checks:
not ok 8 - local upload feature used
not ok 9 - image of hdd uploaded
not ok 10 - can stat /tmp/IWUDY9BiMA/full-stack.d/openqa/share/factory/hdd/core-hdd.qcow2
not ok 11 - exported image has correct permissions (420 -> 0644)
And that is simply because the test failed:
not ok 2 - Expected result not found
not ok 3 - job 1 passed
# job info: {
# 'assigned_worker_id' => 1,
# 'blocked_by_id' => undef,
# 'clone_id' => undef,
# 'group_id' => undef,
# 'id' => 1,
# 'name' => 'tinycore-1-flavor-i386-Build1-core@coolone',
# 'priority' => 50,
# 'result' => 'failed',
# 'settings' => {
# 'ARCH' => 'i386',
# 'BUILD' => '1',
# 'DISTRI' => 'tinycore',
# 'FLAVOR' => 'flavor',
# 'INTEGRATION_TESTS' => '1',
# 'ISO' => 'Core-7.2.iso',
# 'MACHINE' => 'coolone',
# 'NAME' => '00000001-tinycore-1-flavor-i386-Build1-core@coolone',
# 'PAUSE_AT' => 'shutdown',
# 'PUBLISH_HDD_1' => 'core-hdd.qcow2',
# 'QEMU' => 'i386',
# 'QEMU_NO_FDC_SET' => '1',
# 'QEMU_NO_KVM' => '1',
# 'QEMU_NO_TABLET' => '1',
# 'TEST' => 'core',
# 'UEFI_PFLASH_VARS' => '/usr/share/qemu/ovmf-x86_64.bin',
# 'VERSION' => '1',
# 'WORKER_CLASS' => 'qemu_i386'
# },
# 'state' => 'done',
# 't_finished' => '2022-01-22T08:07:02',
# 't_started' => '2022-01-22T08:05:09',
# 'test' => 'core'
# }
ok 4 - autoinst log file generated
ok 5 - worker log file generated
ok 6 - autoinst log uploaded
ok 7 - worker log uploaded
It fails in the shutdown
module:
[0m[1;34m[2022-01-22T08:05:26.126716Z] [debug] ||| finished boot tests (runtime: 12 s)
[0m[1;34m[2022-01-22T08:05:26.127797Z] [debug] ||| starting shutdown tests/shutdown.pm
[0m[37m[2022-01-22T08:05:26.128265Z] [debug] isotovideo: pausing test execution of shutdown because we're supposed to pause at this test module
[0m[2022-01-22T08:05:26.128630Z] [debug] tests/shutdown.pm:8 called testapi::type_string
[2022-01-22T08:05:26.128802Z] [debug] <<< testapi::type_string(string="sudo su\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[37m[2022-01-22T08:05:26.129110Z] [debug] isotovideo: paused, so not passing backend_type_string to backend
[0m[37m[2022-01-22T08:05:28.330185Z] [debug] isotovideo: test execution will be resumed
[0m[37m[2022-01-22T08:05:28.330596Z] [debug] isotovideo: resuming, continue passing backend_type_string to backend
[0m[2022-01-22T08:05:28.601320Z] [debug] tests/shutdown.pm:9 called testapi::type_string
[2022-01-22T08:05:28.601518Z] [debug] <<< testapi::type_string(string="poweroff\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-22T08:05:28.910444Z] [debug] tests/shutdown.pm:10 called testapi::assert_shutdown
[2022-01-22T08:05:28.910592Z] [debug] <<< testapi::check_shutdown(timeout=90)
[37m[2022-01-22T08:05:28.911472Z] [debug] QEMU status is not 'shutdown', it is 'running'
[0m[37m[2022-01-22T08:05:29.912649Z] [debug] QEMU status is not 'shutdown', it is 'running'
[0m[37m[2022-01-22T08:05:30.913806Z] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[2]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":996204,"seconds":1642838729}}
[0m[37m[2022-01-22T08:05:30.914058Z] [debug] QEMU status is not 'shutdown', it is 'running'
…
[0m[37m[2022-01-22T08:06:58.035515Z] [debug] QEMU status is not 'shutdown', it is 'running'
[0m[37m[2022-01-22T08:06:59.036839Z] [debug] QEMU status is not 'shutdown', it is 'running'
[0m[33m[2022-01-22T08:07:00.097072Z] [info] ::: basetest::runtest: # Test died: Machine didn't shut down! at tinycore/tests/shutdown.pm line 10.
[0m[1;34m[2022-01-22T08:07:00.099824Z] [debug] ||| finished shutdown tests (runtime: 94 s)
[0m[1;34m[2022-01-22T08:07:00.099889Z] [debug] ||| post fail hooks runtime: 0 s
[0m[37m[2022-01-22T08:07:00.100157Z] [debug] stopping overall test execution after a fatal test failure
Likely the SUT isn't aware that it should shutdown because one of the type_string
commands doesn't work as expected. That might have something to do with the fact that we pause/resume the test execution before the shutdown
module but could also be a general typing issue.
Acceptance criteria¶
- AC1: Retry for full stack test is removed again
Suggestion¶
- Fix #104971 first
- Have a look at the code (e.g. code for handling pausing/resuming in os-autoinst) for possible problems (e.g. lack of synchronization)
- Find a way to reproduce the problem in a unit test
- Workaround the issue by combining
type_string
commands in the test'sshutdown
module - Remove the retry again on Makefile
Updated by livdywan almost 3 years ago
Updated by mkittler almost 3 years ago
- Assignee deleted (
mkittler)
That test failure might be related, indeed. I'm currently having already #104971 so I'm unassigning. (One ticket of that kind is enough. If nobody assigns here soon I'll take it back.)
Updated by mkittler almost 3 years ago
Notes from the meeting:
- The mentioned job (https://openqa.opensuse.org/tests/2157847#step/shutdown/2) fails likely because it has to switch from the graphical environment to the text console which can take a while. There's no good synchronization in the test (only
wait_screen_change
) so it isn't completely surprising that the machine might not be ready. However, our tests (the openQA fullstack test) does not have to switch consoles. Hence, it is likely a different issue. - We could try system_powerdown instead of typing
poweroff
. This would also help us getting fullstack test coverage for that feature. - Note that 2. would potentially still mask a problem in the command processing when the developer mode is used to pause/resume tests. So it might still make sense to follow my suggestions from the ticket description to have a look at it.
- One could try to reproduce the issue within os-autoinst's fullstack test (by setting the test variable
INTEGRATION_TESTS
to avoid being blocked by #104971).
Updated by okurz almost 3 years ago
mkittler wrote:
Notes from the meeting:
- The mentioned job (https://openqa.opensuse.org/tests/2157847#step/shutdown/2) fails likely because it has to switch from the graphical environment to the text console which can take a while. There's no good synchronization in the test (only
wait_screen_change
) so it isn't completely surprising that the machine might not be ready. However, our tests (the openQA fullstack test) does not have to switch consoles. Hence, it is likely a different issue.- We could try system_powerdown instead of typing
poweroff
. This would also help us getting fullstack test coverage for that feature.- Note that 2. would potentially still mask a problem in the command processing when the developer mode is used to pause/resume tests. So it might still make sense to follow my suggestions from the ticket description to have a look at it.
I tried with a patch
--- a/t/data/tests/tests/shutdown.pm
+++ b/t/data/tests/tests/shutdown.pm
@@ -5,8 +5,7 @@ use Mojo::Base 'basetest', -signatures;
use testapi;
sub run {
- type_string "sudo su\n";
- type_string "poweroff\n";
+ power 'off';
assert_shutdown(get_var('INTEGRATION_TESTS') ? 90 : undef);
}
but this made the full stack test fail with
# [2022-01-26T14:06:27.097610+01:00] [debug] ||| starting shutdown tests/shutdown.pm
# [2022-01-26T14:06:27.099532+01:00] [debug] tests/shutdown.pm:8 called testapi::power
# [2022-01-26T14:06:27.099805+01:00] [debug] <<< testapi::power(action="off")
# [2022-01-26T14:06:27.101367+01:00] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[2]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":297928,"seconds":1643202242}}
# [2022-01-26T14:06:27.102545+01:00] [debug] tests/shutdown.pm:9 called testapi::assert_shutdown
# [2022-01-26T14:06:27.102821+01:00] [debug] <<< testapi::check_shutdown(timeout=60)
# [2022-01-26T14:06:27.104357+01:00] [debug] EVENT {"data":{"guest":false,"reason":"host-qmp-quit"},"event":"SHUTDOWN","timestamp":{"microseconds":101650,"seconds":1643202387}}
# dmesg: read kernel buffer failed: Operation not permitted
# [2022-01-26T14:06:27.297549+01:00] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
# [2022-01-26T14:06:27.299926+01:00] [debug] sending magic and exit
# Unexpected end of data 0
# [2022-01-26T14:06:27.311086+01:00] [debug] backend process exited: 0
# [2022-01-26T14:06:27.311343+01:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 11622 and exit status: 1
# [2022-01-26T14:06:27.331434+01:00] [debug] stopping command server 11591 because test execution ended
# [2022-01-26T14:06:27.331546+01:00] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:15223/xyHXbClMTX/broadcast
# [2022-01-26T14:06:27.917100+01:00] [debug] commands process exited: 0
# [2022-01-26T14:06:28.018790+01:00] [debug] done with command server
# Use of uninitialized value $serial in split at /home/okurz/local/os-autoinst/os-autoinst/t/../basetest.pm line 705.
# [2022-01-26T14:06:28.019081+01:00] [debug] stopping autotest process 11594
# [2022-01-26T14:06:28.019540+01:00] [debug] autotest received signal TERM, saving results of current test before exiting
# [2022-01-26T14:06:28.029389+01:00] [debug] [autotest] process exited: 1
# [2022-01-26T14:06:28.130022+01:00] [debug] done with autotest process
# [2022-01-26T14:06:28.130247+01:00] [debug] isotovideo failed
# [2022-01-26T14:06:28.131252+01:00] [debug] stopping backend process 11608
# [2022-01-26T14:06:28.131403+01:00] [debug] done with backend process
# 11586: EXIT 1
# '
# doesn't match '(?^u:\d*: EXIT 0)'
# Failed test 'Result in testresults/result-shutdown.json is ok'
# at t/99-full-stack.t line 68.
# got: 'canceled'
# expected: 'ok'
# Looks like you failed 2 tests of 25.
t/99-full-stack.t .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/25 subtests
Test Summary Report
-------------------
t/99-full-stack.t (Wstat: 512 Tests: 25 Failed: 2)
Failed tests: 1, 19
Non-zero exit status: 2
Files=1, Tests=25, 155 wallclock secs ( 0.08 usr 0.02 sys + 42.57 cusr 4.93 csys = 47.60 CPU)
Result: FAIL
I don't understand right now what this means.
- One could try to reproduce the issue within os-autoinst's fullstack test (by setting the test variable
INTEGRATION_TESTS
to avoid being blocked by #104971).
I could so far not reproduce the issue with cd ~/local/os-autoinst/os-autoinst/t/data/tests && runs=100 count_fail_ratio isotovideo -d INTEGRATION_TESTS=1 CASEDIR=.)
after 40 runs.
Updated by mkittler almost 3 years ago
power 'off'
is invoking QEMU's quit command and thus causing QEMU to terminate completely. Therefore the further queries for the VMs state aren't working anymore. Considering how the QEMU backend is implemented I suppose power 'off'
is not generally useful. I tried power 'acpi'
instead which is using system_powerdown. Annoyingly the Tinycore system doesn't react to it at all.
Updated by mkittler almost 3 years ago
- Status changed from In Progress to Feedback
This PR should fix the problem (see commit message for details): https://github.com/os-autoinst/os-autoinst/pull/1945
I'm actually quite sure that this will work so I won't investigate a potential problem with the command handling at this point (which I can still do if it turns out to be still broken).
I'll revert https://github.com/os-autoinst/openQA/pull/4474 when https://github.com/os-autoinst/os-autoinst/pull/1945 has been merged and the os-autoinst version in openQA's CI has been updated. I can also rebase/reopen https://github.com/os-autoinst/openQA/pull/4460 to do some further testing.
Note that I couldn't reproduce the problem locally so far. That's likely because TinyCore is just booting too fast on my system (and even in the CI the cases where it is slow enough seem to be rare).
Updated by mkittler almost 3 years ago
I conducted over 50 runs in https://github.com/os-autoinst/openQA/pull/4496 (where I forced the new os-autoinst version) and couldn't reproduce the issue anymore. I found one occurrence of another failure (see epic comment). I'll consider the ticket resolved when the new os-autoinst version is updated in openQA's CI.
Updated by okurz almost 3 years ago
Seems like we have regressions on other backends though, is the VNC implementation used for IPMI backend as well?
Updated by mkittler almost 3 years ago
- Status changed from Feedback to Resolved
The regression has been handled in the other ticket.
Note that none of the changes are in openQA's CI yet (see https://github.com/os-autoinst/openQA/blob/master/tools/ci/autoinst.sha). There are other problems with the fullstack test (see parent ticket) so I cannot remove the retry as conclusion of this ticket. Hence I'm resolving it now regardless as there's nothing more to do (regarding this particular problem).
Updated by mkittler almost 3 years ago
By the way, the regression has nothing to do with changes I made for this ticket (https://github.com/os-autoinst/os-autoinst/pull/1945) so I'm removing the link.