Project

General

Profile

action #71845

tests fail in circleci with "Non-zero wait status: <non-negative number>" but "All <X> subtests passed"

Added by okurz 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-09-24
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

For example
https://app.circleci.com/pipelines/github/os-autoinst/openQA/4328/workflows/2380a4fe-7ea3-4f46-b0ad-99f9aacb9e7e/jobs/41532/steps
shows:

[08:43:42] t/api/09-comments.t ................      All 7 subtests passed 
[08:43:58] t/api/10-jobgroups.t ............... ok    17566 ms ( 0.17 usr  0.01 sys + 15.13 cusr  0.81 csys = 16.12 CPU)
[08:44:16] t/api/11-bugs.t .................... ok    14348 ms ( 0.03 usr  0.00 sys + 12.18 cusr  1.48 csys = 13.69 CPU)
[08:44:30] t/api/12-admin-workers.t ........... ok    14216 ms ( 0.00 usr  0.00 sys + 11.71 cusr  1.83 csys = 13.54 CPU)
[08:44:44] t/api/13-influxdb.t ................ ok    13538 ms ( 0.01 usr  0.00 sys + 11.65 cusr  1.26 csys = 12.92 CPU)
[08:44:58] t/api/14-plugin_obs_rsync_async.t .. ok    46127 ms ( 0.07 usr  0.00 sys + 16.40 cusr  2.93 csys = 19.40 CPU)
[08:45:44] t/api/14-plugin_obs_rsync.t ........ ok    97631 ms ( 0.10 usr  0.01 sys + 81.41 cusr 10.57 csys = 92.09 CPU)
[08:47:22] t/api/15-search.t .................. ok    13592 ms ( 0.02 usr  0.00 sys + 11.22 cusr  1.63 csys = 12.87 CPU)
[08:47:35] t/api/15-users.t ................... ok    12358 ms ( 0.01 usr  0.00 sys + 10.75 cusr  1.02 csys = 11.78 CPU)
[08:47:48]

Test Summary Report
-------------------
t/api/09-comments.t              (Wstat: 14 Tests: 7 Failed: 0)
  Non-zero wait status: 14

t/api/09-comments.t reported "All 7 subtests passed" and also the "tests" tab in circleCI does show that all tests passed and the artifact file, e.g. https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/58f7029dc9e77c000129905e/46416941/5f6c5695c60fc519f3c9f54a-0-build/artifacts/artifacts/api/09-comments.t?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200924T124432Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20200924%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=3605db1a4895c0bab1d91b1656cb39c5bfede6a15b027eed28259d127a4f7542 shows in the end:

ok 7 - no (unexpected) warnings (via done_testing)
1..7

so all tests passed and "done_testing()" has been passed.

Problem

tests abort even though everything else looks fine. We do not know what is wrong here. https://perldoc.perl.org/Test/More.html#EXIT-CODES
describes that Test::Builder itself would return 0 if all tests passed, 255 if the test aborted prematurely or a non-zero number corresponding to the number of failed tests if any. So this is most likely not the problem here.
Reading the source code of TAP::Formatter suggests that "Non-zero wait status: " has something to do with the exit code of other processes that we might wait for (implicitly or explicitly).


Related issues

Related to openQA Project - action #71464: Show output of failed tests within circleci log view, like on local prove calls, not only "Tests" tabResolved2020-09-17

History

#1 Updated by okurz 10 months ago

  • Description updated (diff)

#2 Updated by okurz 10 months ago

  • Description updated (diff)

#3 Updated by okurz 10 months ago

minimum reproducer:

#!/usr/bin/env perl
use Test::Most;
use FindBin;
use lib "$FindBin::Bin/../lib";
use OpenQA::Test::TimeLimit '2';
pass 'ok';

END {
    sleep 3; 
}

done_testing();

triggers:

$ prove t/api/01-workers.t 
t/api/01-workers.t .. All 1 subtests passed 

Test Summary Report
-------------------
t/api/01-workers.t (Wstat: 14 Tests: 1 Failed: 0)
  Non-zero wait status: 14
Files=1, Tests=1,  2 wallclock secs ( 0.03 usr  0.00 sys +  0.09 cusr  0.00 csys =  0.12 CPU)
Result: FAIL

same with do { 1 } while(1); so this is not about an incompatibility between "sleep" and "alarm" or so.

So my most probable hypothesis right now is that Devel::Cover is involved. Either it is stuck itself completely (unlikely) or it just takes much longer. In the end, the output that we see with "Non-zero wait status: 14" just means: "The test timed out after done_testing()" (within the 'END' block). So not pretty but still a clear situation. So what we could do is scale each timeout when running in CI and/or under Devel::Cover.

#4 Updated by okurz 10 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

#6 Updated by okurz 10 months ago

  • Related to action #71464: Show output of failed tests within circleci log view, like on local prove calls, not only "Tests" tab added

#7 Updated by okurz 10 months ago

https://github.com/os-autoinst/openQA/pull/3419 merged. https://github.com/os-autoinst/openQA/pull/3420 to have lowered and more restrictive timeouts. I needed to cover OBS as well to be more forgiving.

#8 Updated by okurz 10 months ago

  • Status changed from In Progress to Feedback

#9 Updated by okurz 10 months ago

  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/openQA/pull/3420 merged. I consider all done here.

Also available in: Atom PDF