https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842021-11-17T06:24:23ZopenSUSE Project Management ToolopenQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4652122021-11-17T06:24:23Zokurzokurz@suse.com
<ul><li><strong>Project</strong> changed from <i>openQA Infrastructure</i> to <i>openQA Project</i></li><li><strong>Category</strong> set to <i>Support</i></li><li><strong>Target version</strong> set to <i>Ready</i></li></ul> openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4657522021-11-18T10:43:32Zlivdywanliv.dywan@suse.com
<ul><li><strong>Subject</strong> changed from <i>Non-fatal test modules aborting the whole job</i> to <i>Non-fatal test modules aborting the whole job size:M</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/465752/diff?detail_id=440459">diff</a>)</li><li><strong>Status</strong> changed from <i>New</i> to <i>Workable</i></li></ul> openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4665692021-11-22T15:54:57Zmkittlermarius.kittler@suse.com
<ul><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul><p>The first attempted fix (setting <code>QEMU_DISABLE_SNAPSHOTS</code>) shouldn't have any effect because the test is using the <code>svirt</code> backend. To me it also isn't clear whether snapshotting is actually the cause of the failure here. It is a backend feature and a problem there would <em>normally</em> result in an incomplete job (and not a failed one).</p>
openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4665812021-11-22T16:32:20Zmkittlermarius.kittler@suse.com
<ul></ul><p>The execution time isn't recorded in these cases as well. However, that's apparently also the case when further modules are processed, e.g. <a href="https://openqa.suse.de/tests/7717217">https://openqa.suse.de/tests/7717217</a>. In this example the failure is also <em>very</em> similar:</p>
<pre><code># Test died: command 'docker-compose pull' failed at /usr/lib/os-autoinst/testapi.pm line 950.
testapi::_handle_script_run_ret(127, "docker-compose pull", "quiet", undef, "timeout", 600, "fail_message", "") called at /usr/lib/os-autoinst/testapi.pm line 988
testapi::assert_script_run("docker-compose pull", 600) called at sle/tests/containers/docker_compose.pm line 55
docker_compose::run(docker_compose=HASH(0x5624cd70b1d8)) called at /usr/lib/os-autoinst/basetest.pm line 344
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
basetest::runtest(docker_compose=HASH(0x5624cd70b1d8)) called at /usr/lib/os-autoinst/autotest.pm line 368
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658), CODE(0x5624ceea2870)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5624cf81c658)) called at /usr/lib/os-autoinst/autotest.pm line 294
autotest::start_process() called at /usr/bin/isotovideo line 260
</code></pre>
<p>So the kind of the error doesn't make this case special.</p>
<hr>
<p>Just for the record, in the problematic tests where no further modules are executed the failure is:</p>
<pre><code># Test died: command 'docker system prune -a -f' timed out at /usr/lib/os-autoinst/testapi.pm line 947.
testapi::_handle_script_run_ret(undef, "docker system prune -a -f", "timeout", 300, "fail_message", "", "quiet", undef) called at /usr/lib/os-autoinst/testapi.pm line 988
testapi::assert_script_run("docker system prune -a -f", 300) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/lib/containers/engine.pm line 26
containers::engine::_engine_assert_script_run(containers::docker=HASH(0x100335702d0), "system prune -a -f", 300) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/lib/containers/engine.pm line 286
containers::engine::cleanup_system_host(containers::docker=HASH(0x100335702d0)) called at /var/lib/openqa/pool/36/os-autoinst-distri-opensuse/tests/containers/docker_image.pm line 47
docker_image::run(docker_image=HASH(0x10031fb1530)) called at /usr/lib/os-autoinst/basetest.pm line 344
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
basetest::runtest(docker_image=HASH(0x10031fb1530)) called at /usr/lib/os-autoinst/autotest.pm line 368
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8), CODE(0x10032fb5960)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x100332791a8)) called at /usr/lib/os-autoinst/autotest.pm line 294
autotest::start_process() called at /usr/bin/isotovideo line 260
</code></pre>
<p>The second problematic test looks different, though:</p>
<pre><code># Test died: zypper inside container timed out at sle/lib/containers/utils.pm line 155.
containers::utils::basic_container_tests("runtime", "podman") called at sle/tests/containers/podman.pm line 38
podman::run(podman=HASH(0x5634bc65c358)) called at /usr/lib/os-autoinst/basetest.pm line 344
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 338
basetest::runtest(podman=HASH(0x5634bc65c358)) called at /usr/lib/os-autoinst/autotest.pm line 368
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 368
autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 236
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 236
autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 292
autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0), CODE(0x5634bd1fbd78)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 477
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5634bcc897f0)) called at /usr/lib/os-autoinst/autotest.pm line 294
autotest::start_process() called at /usr/bin/isotovideo line 260
</code></pre>
<p>In any case, it is basically just an unhandled <code>croak</code>/<code>die</code> because a command runs into a timeout.</p>
openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4669682021-11-23T17:07:48Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>Feedback</i></li></ul><p>The difference between the "good" case where further modules are executed (<a href="https://openqa.suse.de/tests/7717217" class="external">https://openqa.suse.de/tests/7717217</a>) and the "bad" case where further modules are skipped (<a href="https://openqa.suse.de/tests/7687832" class="external">https://openqa.suse.de/tests/7687832</a>) is whether snapshotting is enabled (or supported by the backend). In the "bad" case snapshotting is disabled via <code>QEMU_DISABLE_SNAPSHOTS=1</code> (that's likely due to "fix1" mentioned in the ticket description). So the "bad" case runs into this if-block in <code>autotest::runalltests</code>:</p>
<pre><code> if ($t->{fatal_failure} || $flags->{fatal} || (!exists $flags->{fatal} && !$snapshots_supported) || $bmwqemu::vars{TESTDEBUG}) {
bmwqemu::stop_vm();
return 0;
}
</code></pre>
<p>This condition has the effect that os-autoinst <strong>always</strong> stops executing further test modules if a "test died" and snapshotting is <em>not</em> supported (regardless of the test flags) unless one adds the test flag <code>fatal => 0</code> explicitly. The part after the "unless" has been implemented some time ago by me for <a class="issue tracker-6 status-3 priority-4 priority-default closed" title="coordination: EPIC: Continue tests after failures on !qemu (Resolved)" href="https://progress.opensuse.org/issues/58166">#58166</a>, see <a href="https://github.com/os-autoinst/os-autoinst/commit/9d81cdb8e710f05f1890d224fc6c3af98a876de4" class="external">https://github.com/os-autoinst/os-autoinst/commit/9d81cdb8e710f05f1890d224fc6c3af98a876de4</a> for details. So if you have the same use-case than <a class="issue tracker-6 status-3 priority-4 priority-default closed" title="coordination: EPIC: Continue tests after failures on !qemu (Resolved)" href="https://progress.opensuse.org/issues/58166">#58166</a> (test should continue despite the lack of snapshotting) you can simply add <code>fatal => 0</code> to the test flags.</p>
<p>Note that svirt does not generally support snapshots. It only supports snapshots when <code>VIRSH_VMM_FAMILY</code> is <code>kvm</code>, <code>hyperv</code> or <code>vmware</code>. Hence the other "bad" case (<a href="https://openqa.suse.de/tests/7690041" class="external">https://openqa.suse.de/tests/7690041</a>) ran into the problem as well.</p>
openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4669712021-11-23T17:10:25Zmkittlermarius.kittler@suse.com
<ul></ul><p>Looks like I've even documented the behavior last time I came across this topic: <a href="https://open.qa/docs/#_how_to_write_tests" class="external">https://open.qa/docs/#_how_to_write_tests</a></p>
<p>To be honest, I only remembered after looking at the if-condition within the code. So regardless of the behavior being documented or not it would make sense to add a log message which states the reason for the early test exit.</p>
openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4669742021-11-23T17:24:20Zmkittlermarius.kittler@suse.com
<ul></ul><p>PR for adding a log message: <a href="https://github.com/os-autoinst/os-autoinst/pull/1859" class="external">https://github.com/os-autoinst/os-autoinst/pull/1859</a></p>
openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4675172021-11-24T12:15:48Zokurzokurz@suse.com
<ul><li><strong>Due date</strong> set to <i>2021-12-06</i></li></ul> openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4679702021-11-25T15:21:12Zmkittlermarius.kittler@suse.com
<ul></ul><p>The PR to improve the logging has been merged.</p>
openQA Project - action #102539: Non-fatal test modules aborting the whole job size:Mhttps://progress.opensuse.org/issues/102539?journal_id=4692572021-11-30T14:33:19Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>I've got no further feedback but I assume the improved log message is good enough.</p>