action #102539
closedNon-fatal test modules aborting the whole job size:M
Description
Hello,
we're starting to encounter jobs where a failed test module aborts the whole run.
At first, we thought this is specific to ssh-tunnel
(f.e. test1) so we tried fix1 and fix2.
Today we found that also s390-kvm
tests (f.e. test2) are affected. We still don't have any solution to this.
Can you help us to hunt down and fix this issue?
Suggestions¶
- Ask for context what the tests do and what the suggested fixes might do
- Identify a regression or file a new ticket as needed
Updated by okurz almost 3 years ago
- Project changed from openQA Infrastructure to openQA Project
- Category set to Support
- Target version set to Ready
Updated by livdywan almost 3 years ago
- Subject changed from Non-fatal test modules aborting the whole job to Non-fatal test modules aborting the whole job size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler almost 3 years ago
- Assignee set to mkittler
The first attempted fix (setting QEMU_DISABLE_SNAPSHOTS
) shouldn't have any effect because the test is using the svirt
backend. To me it also isn't clear whether snapshotting is actually the cause of the failure here. It is a backend feature and a problem there would normally result in an incomplete job (and not a failed one).
Updated by mkittler almost 3 years ago
The execution time isn't recorded in these cases as well. However, that's apparently also the case when further modules are processed, e.g. https://openqa.suse.de/tests/7717217. In this example the failure is also very similar:
# Test died: command 'docker-compose pull' failed at /usr/lib/os-autoinst/testapi.pm line 950.
testapi::_handle_script_run_ret(127, "docker-compose pull", "quiet", undef, "timeout", 600, "fail_message", "") called at /usr/lib/os-autoinst/testapi.pm line 988
testapi::assert_script_run("docker-compose pull", 600) called at sle/tests/containers/docker_compose.pm line 55
docker_compose::run(docker_compose=HASH(0x5624cd70b1d8)) called at /usr/lib/os-autoinst/basetest.pm line 344
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
basetest::runtest(docker_compose=HASH(0x5624cd70b1d8)) called at /usr/lib/os-autoinst/autotest.pm line 368
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658), CODE(0x5624ceea2870)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658)) called at /usr/lib/os-autoinst/autotest.pm line 294
autotest::start_process() called at /usr/bin/isotovideo line 260
So the kind of the error doesn't make this case special.
Just for the record, in the problematic tests where no further modules are executed the failure is:
# Test died: command 'docker system prune -a -f' timed out at /usr/lib/os-autoinst/testapi.pm line 947.
testapi::_handle_script_run_ret(undef, "docker system prune -a -f", "timeout", 300, "fail_message", "", "quiet", undef) called at /usr/lib/os-autoinst/testapi.pm line 988
testapi::assert_script_run("docker system prune -a -f", 300) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/lib/containers/engine.pm line 26
containers::engine::_engine_assert_script_run(containers::docker=HASH(0x100335702d0), "system prune -a -f", 300) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/lib/containers/engine.pm line 286
containers::engine::cleanup_system_host(containers::docker=HASH(0x100335702d0)) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/tests/containers/docker_image.pm line 47
docker_image::run(docker_image=HASH(0x10031fb1530)) called at /usr/lib/os-autoinst/basetest.pm line 344
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
basetest::runtest(docker_image=HASH(0x10031fb1530)) called at /usr/lib/os-autoinst/autotest.pm line 368
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8), CODE(0x10032fb5960)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8)) called at /usr/lib/os-autoinst/autotest.pm line 294
autotest::start_process() called at /usr/bin/isotovideo line 260
The second problematic test looks different, though:
# Test died: zypper inside container timed out at sle/lib/containers/utils.pm line 155.
containers::utils::basic_container_tests("runtime", "podman") called at sle/tests/containers/podman.pm line 38
podman::run(podman=HASH(0x5634bc65c358)) called at /usr/lib/os-autoinst/basetest.pm line 344
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
basetest::runtest(podman=HASH(0x5634bc65c358)) called at /usr/lib/os-autoinst/autotest.pm line 368
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0), CODE(0x5634bd1fbd78)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0)) called at /usr/lib/os-autoinst/autotest.pm line 294
autotest::start_process() called at /usr/bin/isotovideo line 260
In any case, it is basically just an unhandled croak
/die
because a command runs into a timeout.
Updated by mkittler almost 3 years ago
- Status changed from Workable to Feedback
The difference between the "good" case where further modules are executed (https://openqa.suse.de/tests/7717217) and the "bad" case where further modules are skipped (https://openqa.suse.de/tests/7687832) is whether snapshotting is enabled (or supported by the backend). In the "bad" case snapshotting is disabled via QEMU_DISABLE_SNAPSHOTS=1
(that's likely due to "fix1" mentioned in the ticket description). So the "bad" case runs into this if-block in autotest::runalltests
:
if ($t->{fatal_failure} || $flags->{fatal} || (!exists $flags->{fatal} && !$snapshots_supported) || $bmwqemu::vars{TESTDEBUG}) {
bmwqemu::stop_vm();
return 0;
}
This condition has the effect that os-autoinst always stops executing further test modules if a "test died" and snapshotting is not supported (regardless of the test flags) unless one adds the test flag fatal => 0
explicitly. The part after the "unless" has been implemented some time ago by me for #58166, see https://github.com/os-autoinst/os-autoinst/commit/9d81cdb8e710f05f1890d224fc6c3af98a876de4 for details. So if you have the same use-case than #58166 (test should continue despite the lack of snapshotting) you can simply add fatal => 0
to the test flags.
Note that svirt does not generally support snapshots. It only supports snapshots when VIRSH_VMM_FAMILY
is kvm
, hyperv
or vmware
. Hence the other "bad" case (https://openqa.suse.de/tests/7690041) ran into the problem as well.
Updated by mkittler almost 3 years ago
Looks like I've even documented the behavior last time I came across this topic: https://open.qa/docs/#_how_to_write_tests
To be honest, I only remembered after looking at the if-condition within the code. So regardless of the behavior being documented or not it would make sense to add a log message which states the reason for the early test exit.
Updated by mkittler almost 3 years ago
PR for adding a log message: https://github.com/os-autoinst/os-autoinst/pull/1859
Updated by mkittler almost 3 years ago
The PR to improve the logging has been merged.
Updated by mkittler almost 3 years ago
- Status changed from Feedback to Resolved
I've got no further feedback but I assume the improved log message is good enough.