Project

General

Profile

Actions

action #92164

closed

t/01-test-utilities.t fails in 'test would have failed' in unrelated PR

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-05-05
Due date:
2021-05-19
% Done:

0%

Estimated time:

Description

Observation

In https://github.com/os-autoinst/openQA/pull/3877
in https://circleci.com/gh/os-autoinst/openQA/60151?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link t/01-test-utilities.t fails with

not ok 1 - warnings in sub processes are fatal test failures
Subtest: warnings in sub processes are fatal test failures
    # PID of test-process-1: 141
    # waiting at most 42 seconds for SIGCHLD (sleep is supposed to be interrupted by SIGCHLD)
    ok 1 - warning logged
    not ok 2 - test would have failed
    ok 3 - message of test failure
    ok 4 - exit code of test failure is non-zero
    not ok 5 - manual termination via stop_service does not trigger _fail_and_exit
    1..5
Actions #1

Updated by okurz almost 3 years ago

On my work notebook local environment 1k runs (!) have succeeded, in circleCI a second run failed: https://app.circleci.com/pipelines/github/os-autoinst/openQA/6386/workflows/ba9d9a66-05f7-4c14-b2ea-921dd5a2e272/jobs/60180/steps . Anyone an idea?

Actions #2

Updated by okurz almost 3 years ago

mkittler mentioned maybe it's because of coverage analysis. Trying with

runs=2000 count_fail_ratio make coverage KEEP_DB=1 TESTS=t/01-test-utilities.t

but that fails for me with Can't read /home/okurz/local/os-autoinst/openQA/cover_db/runs/1620218733.13161.31806/cover.14 with Sereal: Sereal: Error: Bad Sereal header: Not a valid Sereal document. at offset 1 of input at srl_decoder.c line 600 at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Devel/Cover/DB/IO/Sereal.pm line 34, <$fh> chunk 1.. I think I had something like this already in the past. What's missing now?

Actions #3

Updated by okurz almost 3 years ago

Using https://github.com/os-autoinst/openQA/pull/3885 as workaround I can proceed. Running tests with coverage now in a loop locally …

Actions #4

Updated by okurz almost 3 years ago

  • Due date set to 2021-05-19
  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal

experiment finished, runs=2000 count_fail_ratio make coverage KEEP_DB=1 TESTS=t/01-test-utilities.t yields:

## count_fail_ratio: Run: 2000. Fails: 0. Fail ratio 0%

so at least not easily reproducible this way with coverage locally, must be something else. Any other ideas what to try?

Actions #5

Updated by tinita almost 3 years ago

Maybe not a new observation, but at least I wanted to note it:
I ran the tests on CircleCI in a loop (with a dummy fail and RETRY), and in the failing cases the output

ok 1 - warning logged
...

always appeared around 40 seconds after the note

# waiting at most 42 seconds for SIGCHLD (sleep is supposed to be interrupted by SIGCHLD)

So at least we know that the SIGCHLD is not interrupting the sleep in this case.

Full log of 20 runs:
https://app.circleci.com/pipelines/github/perlpunk/openQA/167/workflows/a5c11032-e6b3-4bf4-a548-70a029d0f4c0/jobs/743

Actions #6

Updated by tinita almost 3 years ago

I added more logging, and it appears that the SIGWARN handler setup in OpenQA::Test::Utils::_setup_sub_process is simply called before the SIGCHLD handler has been installed.

Actions #7

Updated by tinita almost 3 years ago

  • Assignee changed from okurz to tinita
Actions #8

Updated by okurz almost 3 years ago

This is a great finding! I just hope we find a better way than sleep 1 which I despise :D

Actions #9

Updated by tinita almost 3 years ago

  • Status changed from Feedback to Resolved
Actions #11

Updated by okurz almost 3 years ago

We assume that the last failure was in an older branch. We assume the problem is resolved, otherwise we can reopen or try in a new ticket.

Actions

Also available in: Atom PDF