action #102539: Non-fatal test modules aborting the whole job size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #102539

closed

Non-fatal test modules aborting the whole job size:M

Added by pdostal over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Support

Target version:

Ready

Start date:

2021-11-16

Due date:

2021-12-06

% Done:

Estimated time:

Description

Hello,

we're starting to encounter jobs where a failed test module aborts the whole run.
At first, we thought this is specific to ssh-tunnel (f.e. test1) so we tried fix1 and fix2.
Today we found that also s390-kvm tests (f.e. test2) are affected. We still don't have any solution to this.

Can you help us to hunt down and fix this issue?

Suggestions¶

Ask for context what the tests do and what the suggested fixes might do
Identify a regression or file a new ticket as needed

Actions

Copy link

Updated by okurz over 3 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category set to Support
Target version set to Ready

Actions

Copy link

Updated by livdywan over 3 years ago

Subject changed from Non-fatal test modules aborting the whole job to Non-fatal test modules aborting the whole job size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 3 years ago

Assignee set to mkittler

The first attempted fix (setting QEMU_DISABLE_SNAPSHOTS) shouldn't have any effect because the test is using the svirt backend. To me it also isn't clear whether snapshotting is actually the cause of the failure here. It is a backend feature and a problem there would normally result in an incomplete job (and not a failed one).

Actions

Copy link

Updated by mkittler over 3 years ago

The execution time isn't recorded in these cases as well. However, that's apparently also the case when further modules are processed, e.g. https://openqa.suse.de/tests/7717217. In this example the failure is also very similar:

# Test died: command 'docker-compose pull' failed at /usr/lib/os-autoinst/testapi.pm line 950.
    testapi::_handle_script_run_ret(127, "docker-compose pull", "quiet", undef, "timeout", 600, "fail_message", "") called at /usr/lib/os-autoinst/testapi.pm line 988
    testapi::assert_script_run("docker-compose pull", 600) called at sle/tests/containers/docker_compose.pm line 55
    docker_compose::run(docker_compose=HASH(0x5624cd70b1d8)) called at /usr/lib/os-autoinst/basetest.pm line 344
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
    basetest::runtest(docker_compose=HASH(0x5624cd70b1d8)) called at /usr/lib/os-autoinst/autotest.pm line 368
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658), CODE(0x5624ceea2870)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658)) called at /usr/lib/os-autoinst/autotest.pm line 294
    autotest::start_process() called at /usr/bin/isotovideo line 260

So the kind of the error doesn't make this case special.

Just for the record, in the problematic tests where no further modules are executed the failure is:

# Test died: command 'docker system prune -a -f' timed out at /usr/lib/os-autoinst/testapi.pm line 947.
    testapi::_handle_script_run_ret(undef, "docker system prune -a -f", "timeout", 300, "fail_message", "", "quiet", undef) called at /usr/lib/os-autoinst/testapi.pm line 988
    testapi::assert_script_run("docker system prune -a -f", 300) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/lib/containers/engine.pm line 26
    containers::engine::_engine_assert_script_run(containers::docker=HASH(0x100335702d0), "system prune -a -f", 300) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/lib/containers/engine.pm line 286
    containers::engine::cleanup_system_host(containers::docker=HASH(0x100335702d0)) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/tests/containers/docker_image.pm line 47
    docker_image::run(docker_image=HASH(0x10031fb1530)) called at /usr/lib/os-autoinst/basetest.pm line 344
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
    basetest::runtest(docker_image=HASH(0x10031fb1530)) called at /usr/lib/os-autoinst/autotest.pm line 368
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8), CODE(0x10032fb5960)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8)) called at /usr/lib/os-autoinst/autotest.pm line 294
    autotest::start_process() called at /usr/bin/isotovideo line 260

The second problematic test looks different, though:

# Test died: zypper inside container timed out at sle/lib/containers/utils.pm line 155.
    containers::utils::basic_container_tests("runtime", "podman") called at sle/tests/containers/podman.pm line 38
    podman::run(podman=HASH(0x5634bc65c358)) called at /usr/lib/os-autoinst/basetest.pm line 344
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
    basetest::runtest(podman=HASH(0x5634bc65c358)) called at /usr/lib/os-autoinst/autotest.pm line 368
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0), CODE(0x5634bd1fbd78)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0)) called at /usr/lib/os-autoinst/autotest.pm line 294
    autotest::start_process() called at /usr/bin/isotovideo line 260

In any case, it is basically just an unhandled croak/die because a command runs into a timeout.

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from Workable to Feedback

The difference between the "good" case where further modules are executed (https://openqa.suse.de/tests/7717217) and the "bad" case where further modules are skipped (https://openqa.suse.de/tests/7687832) is whether snapshotting is enabled (or supported by the backend). In the "bad" case snapshotting is disabled via QEMU_DISABLE_SNAPSHOTS=1 (that's likely due to "fix1" mentioned in the ticket description). So the "bad" case runs into this if-block in autotest::runalltests:

            if ($t->{fatal_failure} || $flags->{fatal} || (!exists $flags->{fatal} && !$snapshots_supported) || $bmwqemu::vars{TESTDEBUG}) {
                bmwqemu::stop_vm();
                return 0;
            }

This condition has the effect that os-autoinst always stops executing further test modules if a "test died" and snapshotting is not supported (regardless of the test flags) unless one adds the test flag fatal => 0 explicitly. The part after the "unless" has been implemented some time ago by me for #58166, see https://github.com/os-autoinst/os-autoinst/commit/9d81cdb8e710f05f1890d224fc6c3af98a876de4 for details. So if you have the same use-case than #58166 (test should continue despite the lack of snapshotting) you can simply add fatal => 0 to the test flags.

Note that svirt does not generally support snapshots. It only supports snapshots when VIRSH_VMM_FAMILY is kvm, hyperv or vmware. Hence the other "bad" case (https://openqa.suse.de/tests/7690041) ran into the problem as well.

Actions

Copy link

Updated by mkittler over 3 years ago

Looks like I've even documented the behavior last time I came across this topic: https://open.qa/docs/#_how_to_write_tests

To be honest, I only remembered after looking at the if-condition within the code. So regardless of the behavior being documented or not it would make sense to add a log message which states the reason for the early test exit.

Actions

Copy link