action #71845
closedtests fail in circleci with "Non-zero wait status: <non-negative number>" but "All <X> subtests passed"
Description
Observation¶
For example
https://app.circleci.com/pipelines/github/os-autoinst/openQA/4328/workflows/2380a4fe-7ea3-4f46-b0ad-99f9aacb9e7e/jobs/41532/steps
shows:
[08:43:42] t/api/09-comments.t ................ All 7 subtests passed
[08:43:58] t/api/10-jobgroups.t ............... ok 17566 ms ( 0.17 usr 0.01 sys + 15.13 cusr 0.81 csys = 16.12 CPU)
[08:44:16] t/api/11-bugs.t .................... ok 14348 ms ( 0.03 usr 0.00 sys + 12.18 cusr 1.48 csys = 13.69 CPU)
[08:44:30] t/api/12-admin-workers.t ........... ok 14216 ms ( 0.00 usr 0.00 sys + 11.71 cusr 1.83 csys = 13.54 CPU)
[08:44:44] t/api/13-influxdb.t ................ ok 13538 ms ( 0.01 usr 0.00 sys + 11.65 cusr 1.26 csys = 12.92 CPU)
[08:44:58] t/api/14-plugin_obs_rsync_async.t .. ok 46127 ms ( 0.07 usr 0.00 sys + 16.40 cusr 2.93 csys = 19.40 CPU)
[08:45:44] t/api/14-plugin_obs_rsync.t ........ ok 97631 ms ( 0.10 usr 0.01 sys + 81.41 cusr 10.57 csys = 92.09 CPU)
[08:47:22] t/api/15-search.t .................. ok 13592 ms ( 0.02 usr 0.00 sys + 11.22 cusr 1.63 csys = 12.87 CPU)
[08:47:35] t/api/15-users.t ................... ok 12358 ms ( 0.01 usr 0.00 sys + 10.75 cusr 1.02 csys = 11.78 CPU)
[08:47:48]
Test Summary Report
-------------------
t/api/09-comments.t (Wstat: 14 Tests: 7 Failed: 0)
Non-zero wait status: 14
t/api/09-comments.t reported "All 7 subtests passed" and also the "tests" tab in circleCI does show that all tests passed and the artifact file, e.g. https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6c5695c60fc519f3c9f54a-0-build/artifacts/artifacts/api/09-comments.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T124432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=3605db1a4895c0bab1d91b1656cb39c5bfede6a15b027eed28259d127a4f7542 shows in the end:
ok 7 - no (unexpected) warnings (via done_testing)
1..7
so all tests passed and "done_testing()" has been passed.
- https://app.circleci.com/pipelines/github/os-autoinst/openQA/4328/workflows/2380a4fe-7ea3-4f46-b0ad-99f9aacb9e7e/jobs/41532/steps shows that t/api/09-comments.t failed
- https://app.circleci.com/pipelines/github/os-autoinst/openQA/4329/workflows/01c4a5ce-86aa-4ba7-b77e-4a5a3d216c3d/jobs/41542 shows t/api/01-workers.t failed as well as t/api/02-iso.t .
Problem¶
tests abort even though everything else looks fine. We do not know what is wrong here. https://perldoc.perl.org/Test/More.html#EXIT-CODES
describes that Test::Builder itself would return 0 if all tests passed, 255 if the test aborted prematurely or a non-zero number corresponding to the number of failed tests if any. So this is most likely not the problem here.
Reading the source code of TAP::Formatter suggests that "Non-zero wait status: " has something to do with the exit code of other processes that we might wait for (implicitly or explicitly).
- H1: could be that it only happens in circleci because we run with "Devel::Cover", compare to https://stackoverflow.com/questions/36639621/what-does-non-zero-wait-status-9-mean-when-running-test-using-develcover
- H2: Something to do with processes in the background, maybe because they close stdout/stderr? See https://github.com/UPC/ravada/issues/194
- H3: It's because of SIGALRM due to OpenQA::Test::TimeLimit and we somehow do not get the message that the test timed out , see
man 7 signal
. Maybe because subprocesses do not handle the alarm in the same way as the main process should do. But the tests already finished so this might still be about stuck background processes
Updated by okurz about 4 years ago
minimum reproducer:
#!/usr/bin/env perl
use Test::Most;
use FindBin;
use lib "$FindBin::Bin/../lib";
use OpenQA::Test::TimeLimit '2';
pass 'ok';
END {
sleep 3;
}
done_testing();
triggers:
$ prove t/api/01-workers.t
t/api/01-workers.t .. All 1 subtests passed
Test Summary Report
-------------------
t/api/01-workers.t (Wstat: 14 Tests: 1 Failed: 0)
Non-zero wait status: 14
Files=1, Tests=1, 2 wallclock secs ( 0.03 usr 0.00 sys + 0.09 cusr 0.00 csys = 0.12 CPU)
Result: FAIL
same with do { 1 } while(1);
so this is not about an incompatibility between "sleep" and "alarm" or so.
So my most probable hypothesis right now is that Devel::Cover is involved. Either it is stuck itself completely (unlikely) or it just takes much longer. In the end, the output that we see with "Non-zero wait status: 14" just means: "The test timed out after done_testing()" (within the 'END' block). So not pretty but still a clear situation. So what we could do is scale each timeout when running in CI and/or under Devel::Cover.
Updated by okurz about 4 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Updated by okurz about 4 years ago
Updated by okurz about 4 years ago
- Related to action #71464: Show output of failed tests within circleci log view, like on local prove calls, not only "Tests" tab added
Updated by okurz about 4 years ago
https://github.com/os-autoinst/openQA/pull/3419 merged. https://github.com/os-autoinst/openQA/pull/3420 to have lowered and more restrictive timeouts. I needed to cover OBS as well to be more forgiving.
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
https://github.com/os-autoinst/openQA/pull/3420 merged. I consider all done here.