Project

General

Profile

action #102539

Non-fatal test modules aborting the whole job size:M

Added by pdostal 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Support
Target version:
Start date:
2021-11-16
Due date:
2021-12-06
% Done:

0%

Estimated time:
Difficulty:

Description

Hello,

we're starting to encounter jobs where a failed test module aborts the whole run.
At first, we thought this is specific to ssh-tunnel (f.e. test1) so we tried fix1 and fix2.
Today we found that also s390-kvm tests (f.e. test2) are affected. We still don't have any solution to this.

Can you help us to hunt down and fix this issue?

Suggestions

  • Ask for context what the tests do and what the suggested fixes might do
  • Identify a regression or file a new ticket as needed

History

#1 Updated by okurz 2 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Support
  • Target version set to Ready

#2 Updated by cdywan 2 months ago

  • Subject changed from Non-fatal test modules aborting the whole job to Non-fatal test modules aborting the whole job size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler 2 months ago

  • Assignee set to mkittler

The first attempted fix (setting QEMU_DISABLE_SNAPSHOTS) shouldn't have any effect because the test is using the svirt backend. To me it also isn't clear whether snapshotting is actually the cause of the failure here. It is a backend feature and a problem there would normally result in an incomplete job (and not a failed one).

#4 Updated by mkittler 2 months ago

The execution time isn't recorded in these cases as well. However, that's apparently also the case when further modules are processed, e.g. https://openqa.suse.de/tests/7717217. In this example the failure is also very similar:

# Test died: command 'docker-compose pull' failed at /usr/lib/os-autoinst/testapi.pm line 950.
    testapi::_handle_script_run_ret(127, "docker-compose pull", "quiet", undef, "timeout", 600, "fail_message", "") called at /usr/lib/os-autoinst/testapi.pm line 988
    testapi::assert_script_run("docker-compose pull", 600) called at sle/tests/containers/docker_compose.pm line 55
    docker_compose::run(docker_compose=HASH(0x5624cd70b1d8)) called at /usr/lib/os-autoinst/basetest.pm line 344
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
    basetest::runtest(docker_compose=HASH(0x5624cd70b1d8)) called at /usr/lib/os-autoinst/autotest.pm line 368
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658), CODE(0x5624ceea2870)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658)) called at /usr/lib/os-autoinst/autotest.pm line 294
    autotest::start_process() called at /usr/bin/isotovideo line 260

So the kind of the error doesn't make this case special.


Just for the record, in the problematic tests where no further modules are executed the failure is:

# Test died: command 'docker system prune -a -f' timed out at /usr/lib/os-autoinst/testapi.pm line 947.
    testapi::_handle_script_run_ret(undef, "docker system prune -a -f", "timeout", 300, "fail_message", "", "quiet", undef) called at /usr/lib/os-autoinst/testapi.pm line 988
    testapi::assert_script_run("docker system prune -a -f", 300) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/lib/containers/engine.pm line 26
    containers::engine::_engine_assert_script_run(containers::docker=HASH(0x100335702d0), "system prune -a -f", 300) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/lib/containers/engine.pm line 286
    containers::engine::cleanup_system_host(containers::docker=HASH(0x100335702d0)) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/tests/containers/docker_image.pm line 47
    docker_image::run(docker_image=HASH(0x10031fb1530)) called at /usr/lib/os-autoinst/basetest.pm line 344
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
    basetest::runtest(docker_image=HASH(0x10031fb1530)) called at /usr/lib/os-autoinst/autotest.pm line 368
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8), CODE(0x10032fb5960)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8)) called at /usr/lib/os-autoinst/autotest.pm line 294
    autotest::start_process() called at /usr/bin/isotovideo line 260

The second problematic test looks different, though:

# Test died: zypper inside container timed out at sle/lib/containers/utils.pm line 155.
    containers::utils::basic_container_tests("runtime", "podman") called at sle/tests/containers/podman.pm line 38
    podman::run(podman=HASH(0x5634bc65c358)) called at /usr/lib/os-autoinst/basetest.pm line 344
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
    basetest::runtest(podman=HASH(0x5634bc65c358)) called at /usr/lib/os-autoinst/autotest.pm line 368
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0), CODE(0x5634bd1fbd78)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0)) called at /usr/lib/os-autoinst/autotest.pm line 294
    autotest::start_process() called at /usr/bin/isotovideo line 260

In any case, it is basically just an unhandled croak/die because a command runs into a timeout.

#5 Updated by mkittler 2 months ago

  • Status changed from Workable to Feedback

The difference between the "good" case where further modules are executed (https://openqa.suse.de/tests/7717217) and the "bad" case where further modules are skipped (https://openqa.suse.de/tests/7687832) is whether snapshotting is enabled (or supported by the backend). In the "bad" case snapshotting is disabled via QEMU_DISABLE_SNAPSHOTS=1 (that's likely due to "fix1" mentioned in the ticket description). So the "bad" case runs into this if-block in autotest::runalltests:

            if ($t->{fatal_failure} || $flags->{fatal} || (!exists $flags->{fatal} && !$snapshots_supported) || $bmwqemu::vars{TESTDEBUG}) {
                bmwqemu::stop_vm();
                return 0;
            }

This condition has the effect that os-autoinst always stops executing further test modules if a "test died" and snapshotting is not supported (regardless of the test flags) unless one adds the test flag fatal => 0 explicitly. The part after the "unless" has been implemented some time ago by me for #58166, see https://github.com/os-autoinst/os-autoinst/commit/9d81cdb8e710f05f1890d224fc6c3af98a876de4 for details. So if you have the same use-case than #58166 (test should continue despite the lack of snapshotting) you can simply add fatal => 0 to the test flags.

Note that svirt does not generally support snapshots. It only supports snapshots when VIRSH_VMM_FAMILY is kvm, hyperv or vmware. Hence the other "bad" case (https://openqa.suse.de/tests/7690041) ran into the problem as well.

#6 Updated by mkittler 2 months ago

Looks like I've even documented the behavior last time I came across this topic: https://open.qa/docs/#_how_to_write_tests

To be honest, I only remembered after looking at the if-condition within the code. So regardless of the behavior being documented or not it would make sense to add a log message which states the reason for the early test exit.

#7 Updated by mkittler 2 months ago

#8 Updated by okurz 2 months ago

  • Due date set to 2021-12-06

#9 Updated by mkittler 2 months ago

The PR to improve the logging has been merged.

#10 Updated by mkittler about 2 months ago

  • Status changed from Feedback to Resolved

I've got no further feedback but I assume the improved log message is good enough.

Also available in: Atom PDF