action #71845: tests fail in circleci with "Non-zero wait status: <non-negative number>" but "All <X> subtests passed" - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #71845

closed

tests fail in circleci with "Non-zero wait status: <non-negative number>" but "All <X> subtests passed"

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2020-09-24

Due date:

% Done:

Estimated time:

Description

Observation¶

For example
https://app.circleci.com/pipelines/github/os-autoinst/openQA/4328/workflows/2380a4fe-7ea3-4f46-b0ad-99f9aacb9e7e/jobs/41532/steps
shows:

[08:43:42] t/api/09-comments.t ................      All 7 subtests passed 
[08:43:58] t/api/10-jobgroups.t ............... ok    17566 ms ( 0.17 usr  0.01 sys + 15.13 cusr  0.81 csys = 16.12 CPU)
[08:44:16] t/api/11-bugs.t .................... ok    14348 ms ( 0.03 usr  0.00 sys + 12.18 cusr  1.48 csys = 13.69 CPU)
[08:44:30] t/api/12-admin-workers.t ........... ok    14216 ms ( 0.00 usr  0.00 sys + 11.71 cusr  1.83 csys = 13.54 CPU)
[08:44:44] t/api/13-influxdb.t ................ ok    13538 ms ( 0.01 usr  0.00 sys + 11.65 cusr  1.26 csys = 12.92 CPU)
[08:44:58] t/api/14-plugin_obs_rsync_async.t .. ok    46127 ms ( 0.07 usr  0.00 sys + 16.40 cusr  2.93 csys = 19.40 CPU)
[08:45:44] t/api/14-plugin_obs_rsync.t ........ ok    97631 ms ( 0.10 usr  0.01 sys + 81.41 cusr 10.57 csys = 92.09 CPU)
[08:47:22] t/api/15-search.t .................. ok    13592 ms ( 0.02 usr  0.00 sys + 11.22 cusr  1.63 csys = 12.87 CPU)
[08:47:35] t/api/15-users.t ................... ok    12358 ms ( 0.01 usr  0.00 sys + 10.75 cusr  1.02 csys = 11.78 CPU)
[08:47:48]

Test Summary Report
-------------------
t/api/09-comments.t              (Wstat: 14 Tests: 7 Failed: 0)
  Non-zero wait status: 14

t/api/09-comments.t reported "All 7 subtests passed" and also the "tests" tab in circleCI does show that all tests passed and the artifact file, e.g. https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6c5695c60fc519f3c9f54a-0-build/artifacts/artifacts/api/09-comments.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T124432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=3605db1a4895c0bab1d91b1656cb39c5bfede6a15b027eed28259d127a4f7542 shows in the end:

ok 7 - no (unexpected) warnings (via done_testing)
1..7

so all tests passed and "done_testing()" has been passed.

https://app.circleci.com/pipelines/github/os-autoinst/openQA/4328/workflows/2380a4fe-7ea3-4f46-b0ad-99f9aacb9e7e/jobs/41532/steps shows that t/api/09-comments.t failed
https://app.circleci.com/pipelines/github/os-autoinst/openQA/4329/workflows/01c4a5ce-86aa-4ba7-b77e-4a5a3d216c3d/jobs/41542 shows t/api/01-workers.t failed as well as t/api/02-iso.t .

Problem¶

tests abort even though everything else looks fine. We do not know what is wrong here. https://perldoc.perl.org/Test/More.html#EXIT-CODES
describes that Test::Builder itself would return 0 if all tests passed, 255 if the test aborted prematurely or a non-zero number corresponding to the number of failed tests if any. So this is most likely not the problem here.
Reading the source code of TAP::Formatter suggests that "Non-zero wait status: " has something to do with the exit code of other processes that we might wait for (implicitly or explicitly).

H1: could be that it only happens in circleci because we run with "Devel::Cover", compare to https://stackoverflow.com/questions/36639621/what-does-non-zero-wait-status-9-mean-when-running-test-using-develcover
H2: Something to do with processes in the background, maybe because they close stdout/stderr? See https://github.com/UPC/ravada/issues/194
H3: It's because of SIGALRM due to OpenQA::Test::TimeLimit and we somehow do not get the message that the test timed out , see man 7 signal. Maybe because subprocesses do not handle the alarm in the same way as the main process should do. But the tests already finished so this might still be about stuck background processes

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 4 years ago

minimum reproducer:

#!/usr/bin/env perl
use Test::Most;
use FindBin;
use lib "$FindBin::Bin/../lib";
use OpenQA::Test::TimeLimit '2';
pass 'ok';

END {
    sleep 3; 
}

done_testing();

triggers:

$ prove t/api/01-workers.t 
t/api/01-workers.t .. All 1 subtests passed 

Test Summary Report
-------------------
t/api/01-workers.t (Wstat: 14 Tests: 1 Failed: 0)
  Non-zero wait status: 14
Files=1, Tests=1,  2 wallclock secs ( 0.03 usr  0.00 sys +  0.09 cusr  0.00 csys =  0.12 CPU)
Result: FAIL

same with do { 1 } while(1); so this is not about an incompatibility between "sleep" and "alarm" or so.

So my most probable hypothesis right now is that Devel::Cover is involved. Either it is stuck itself completely (unlikely) or it just takes much longer. In the end, the output that we see with "Non-zero wait status: 14" just means: "The test timed out after done_testing()" (within the 'END' block). So not pretty but still a clear situation. So what we could do is scale each timeout when running in CI and/or under Devel::Cover.

Actions

Copy link