action #92164
closedt/01-test-utilities.t fails in 'test would have failed' in unrelated PR
Description
Observation¶
In https://github.com/os-autoinst/openQA/pull/3877
in https://circleci.com/gh/os-autoinst/openQA/60151?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link t/01-test-utilities.t fails with
not ok 1 - warnings in sub processes are fatal test failures
Subtest: warnings in sub processes are fatal test failures
# PID of test-process-1: 141
# waiting at most 42 seconds for SIGCHLD (sleep is supposed to be interrupted by SIGCHLD)
ok 1 - warning logged
not ok 2 - test would have failed
ok 3 - message of test failure
ok 4 - exit code of test failure is non-zero
not ok 5 - manual termination via stop_service does not trigger _fail_and_exit
1..5
Updated by okurz over 3 years ago
On my work notebook local environment 1k runs (!) have succeeded, in circleCI a second run failed: https://app.circleci.com/pipelines/github/os-autoinst/openQA/6386/workflows/ba9d9a66-05f7-4c14-b2ea-921dd5a2e272/jobs/60180/steps . Anyone an idea?
Updated by okurz over 3 years ago
mkittler mentioned maybe it's because of coverage analysis. Trying with
runs=2000 count_fail_ratio make coverage KEEP_DB=1 TESTS=t/01-test-utilities.t
but that fails for me with Can't read /home/okurz/local/os-autoinst/openQA/cover_db/runs/1620218733.13161.31806/cover.14 with Sereal: Sereal: Error: Bad Sereal header: Not a valid Sereal document. at offset 1 of input at srl_decoder.c line 600 at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Devel/Cover/DB/IO/Sereal.pm line 34, <$fh> chunk 1.
. I think I had something like this already in the past. What's missing now?
Updated by okurz over 3 years ago
Using https://github.com/os-autoinst/openQA/pull/3885 as workaround I can proceed. Running tests with coverage now in a loop locally …
Updated by okurz over 3 years ago
- Due date set to 2021-05-19
- Status changed from In Progress to Feedback
- Priority changed from High to Normal
experiment finished, runs=2000 count_fail_ratio make coverage KEEP_DB=1 TESTS=t/01-test-utilities.t
yields:
## count_fail_ratio: Run: 2000. Fails: 0. Fail ratio 0%
so at least not easily reproducible this way with coverage locally, must be something else. Any other ideas what to try?
Updated by tinita over 3 years ago
Maybe not a new observation, but at least I wanted to note it:
I ran the tests on CircleCI in a loop (with a dummy fail and RETRY), and in the failing cases the output
ok 1 - warning logged
...
always appeared around 40 seconds after the note
# waiting at most 42 seconds for SIGCHLD (sleep is supposed to be interrupted by SIGCHLD)
So at least we know that the SIGCHLD is not interrupting the sleep in this case.
Full log of 20 runs:
https://app.circleci.com/pipelines/github/perlpunk/openQA/167/workflows/a5c11032-e6b3-4bf4-a548-70a029d0f4c0/jobs/743
Updated by tinita over 3 years ago
I added more logging, and it appears that the SIGWARN handler setup in OpenQA::Test::Utils::_setup_sub_process
is simply called before the SIGCHLD handler has been installed.
Updated by tinita over 3 years ago
- Assignee changed from okurz to tinita
Updated by okurz over 3 years ago
This is a great finding! I just hope we find a better way than sleep 1 which I despise :D
Updated by mkittler over 3 years ago
Not sure whether it helped: https://github.com/os-autoinst/openQA/pull/3877#issuecomment-836523675
Updated by okurz over 3 years ago
We assume that the last failure was in an older branch. We assume the problem is resolved, otherwise we can reopen or try in a new ticket.