Project

General

Profile

action #103422

[sporadic] os-autoinst: 13-osutils.t:167 Failed test 'Exit code appear in log' in GHA size:M

Added by cdywan about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-11-30
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

First seen on https://github.com/os-autoinst/os-autoinst/pull/1871#issuecomment-984670874:

8: [11:23:53] ./12-bmwqemu.t ........................... ok      914 ms ( 0.01 usr  0.00 sys +  0.82 cusr  0.09 csys =  0.92 CPU)
8: 
8:     #   Failed test 'Exit code appear in log'
8:     #   at ./13-osutils.t line 167.
8:     # STDERR:
8:     # [2021-12-02T11:23:54.401579Z] [debug] running `echo foo`
8:     # [2021-12-02T11:23:54.405060Z] [debug] Command `echo foo` terminated with 1
8:     #   foo
8:     # 
8:     # doesn't match:
8:     # (?^u:terminated with 0)
8:     # as expected
8:     # Looks like you failed 1 test of 10.
8: 
8: #   Failed test 'run_diag'
8: #   at ./13-osutils.t line 190.
8: # Looks like you failed 1 test of 9.
8: [11:23:54] ./13-osutils.t ........................... 
8: Dubious, test returned 1 (wstat 256, 0x100)
8: Failed 1/9 subtests 

Another failure in the same test, maybe related:
https://github.com/perlpunk/os-autoinst/runs/4408870749?check_suite_focus=true

8: runcmd 'rm image.qcow2' failed with exit code 1 at /opt/osutils.pm line 112.
8: # Tests were run but no plan was declared and done_testing() was not seen.
8: # Looks like your test exited with 255 just after 7.
8: [14:05:37] ./13-osutils.t ........................... 
8: Dubious, test returned 255 (wstat 65280, 0xff00)
8: All 7 subtests passed 

More occurrences:

I couldn't find another instance. What I did find was another failure that didn't fail CI in 14-isotovideo.t:128.

Acceptance criteria

  • AC1: Failure in 13-osutils.t fails CI
  • AC2: 13-osutils.t passes w/o failures

Suggestions

  • Confirm when 13-osutils.t started failing
  • Fix 13-osutils.t
  • Compare if 14-isotovideo.t and maybe other tests generally don't fail CI
  • Ensure that the pipeline fails if any tests fail

Related issues

Has duplicate openQA Project - action #103485: [sporadic] os-autoinst: flaky test t/13-osutils.tRejected2021-12-05

Copied to openQA Project - action #103467: [sporadic] os-autoinst: 14-isotovideo.t:133 'no fatal error recorded' size:MResolved2021-11-30

Copied to openQA Project - action #103611: test failure: os-autoinst: 29-backend-driver.t:31 Failed test 'exit logged'Resolved2021-11-30

Copied to openQA Project - action #103692: Only upgrade o3 workers if package checks are good, same as for o3 webui size:MResolved2021-12-25

History

#1 Updated by okurz about 2 months ago

  • Subject changed from os-autoinst: 13-osutils.t:167 Failed test 'Exit code appear in log' in GHA to [sporadic] os-autoinst: 13-osutils.t:167 Failed test 'Exit code appear in log' in GHA

So I understand it's sporadic?

#2 Updated by cdywan about 2 months ago

cdywan wrote:

Suggestions

  • Confirm when 13-osutils.t started failing
  • Compare if 14-isotovideo.t and maybe other tests generally don't fail CI

btw I put this as a suggestion here rather than an additional ticket, since it's probably not feasible to ensure tests fail without fixing them first; of course we can re-consider this when we estimate it

#3 Updated by okurz about 2 months ago

  • Subject changed from [sporadic] os-autoinst: 13-osutils.t:167 Failed test 'Exit code appear in log' in GHA to [sporadic] os-autoinst: 13-osutils.t:167 Failed test 'Exit code appear in log' in GHA size:M
  • Description updated (diff)
  • Status changed from New to Workable

There is also

8: Error writing to virtio/svirt serial terminal: Broken pipe at ./10-terminal.t line 233.
8: 
8: #   Failed test 'Fake terminal process exit status is 0'
8: #   at ./10-terminal.t line 318.
8: #          got: '1'
8: #     expected: '0'
8: 
8: #   Failed test 'Direct test VIRTIO_CONSOLE not set process exit status is 0'
8: #   at ./10-terminal.t line 318.
8: #          got: '32'
8: #     expected: '0'
8: 
8: #   Failed test '[Child 2321] fake_terminal timed out while performing IO'
8: #   at ./10-terminal.t line 365.
8: 
8: #   Failed test '[Child 2322] direct: find login prompt'
8: #   at ./10-terminal.t line 365.
8: #     Structures begin differing at:
8: #          $got->{matched} = '0'
8: #     $expected->{matched} = '1'
8: # Looks like you failed 4 tests of 11.
8: [20:34:52] ./10-terminal.t .......................... 
8: Dubious, test returned 4 (wstat 1024, 0x400)
8: Failed 4/11 subtests 

showing up in https://github.com/os-autoinst/os-autoinst/runs/4386959135?check_suite_focus=true but the pipeline job shows passed, see the green checkmark on https://github.com/os-autoinst/os-autoinst/pull/1789 where we would expect a red cross

#4 Updated by mkittler about 2 months ago

  • Assignee set to mkittler

The issue of 13-osutils.t might be fixed by https://github.com/mudler/Mojo-IOLoop-ReadWriteProcess/pull/22 (released as RWP version 0.31).

If another failing test is ignored as well I suppose the problem isn't really test-specific, though.

#5 Updated by tinita about 2 months ago

This (failures not leading to CI failure) has been going on for several weeks, and my check-for-git-status feature seems to be the culprit :(

Still trying to reproduce locally
https://github.com/perlpunk/os-autoinst/commit/9ef54a85a901a3a2340da53a8d0699de782322a3

edit: it happens only in combination with coverage:

CI=1 WITH_COVER_OPTIONS=1 CHECK_GIT_STATUS=1 ./tools/invoke-tests

now if I make a test fail, then the exit code of the first prove call is 0 and it will continue

#6 Updated by tinita about 2 months ago

Fix for the unrecognized test failures: https://github.com/os-autoinst/os-autoinst/pull/1873 (merged)

#7 Updated by tinita about 2 months ago

Quite embarrassing, but the previous PR (merged already) had a mistake: https://github.com/os-autoinst/os-autoinst/pull/1874 (merged)

#8 Updated by mkittler about 2 months ago

  • Assignee changed from mkittler to tinita

Ok, you're writing comments faster than I can update the ticket with my own finding. I've cross-checked the CMake code and bash syntax and unless #!/bin/bash -e is not effective it must be the prove invocation itself which doesn't return the non-zero exit code correctly. So your PR likely fixes the issue, indeed. I've approved it and I'm assigning you to the ticket.

#9 Updated by mkittler about 2 months ago

  • Status changed from Workable to In Progress

#10 Updated by tinita about 2 months ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback

#11 Updated by tinita about 2 months ago

  • Copied to action #103467: [sporadic] os-autoinst: 14-isotovideo.t:133 'no fatal error recorded' size:M added

#12 Updated by tinita about 2 months ago

  • Description updated (diff)

#13 Updated by okurz about 2 months ago

  • Has duplicate action #103485: [sporadic] os-autoinst: flaky test t/13-osutils.t added

#14 Updated by okurz about 2 months ago

Observed in GHA and
locally

## Run 18
t/13-osutils.t .. 1/?
    #   Failed test 'Exit code appear in log'
    #   at t/13-osutils.t line 167.
    # STDERR:
    # [2021-12-05T11:03:37.186582+01:00] [debug] running `echo foo`
    # [2021-12-05T11:03:37.194499+01:00] [debug] Command `echo foo` terminated with 1
    #   foo
    # 
    # doesn't match:
    # (?^u:terminated with 0)
    # as expected
    # Looks like you failed 1 test of 10.

#   Failed test 'run_diag'
#   at t/13-osutils.t line 190.
# Looks like you failed 1 test of 9.
t/13-osutils.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/9 subtests 

Reproducible

Locally I could only reproduce it in 1/500 (!) runs with runs=500 count_fail_ratio prove -I. t/13-osutils.t

runs=500 count_fail_ratio make test-perl-testsuite WITH_COVER_OPTIONS=1 TESTS=t/13-osutils.t yields

## count_fail_ratio: Run: 500. Fails: 5. Fail ratio 1%

#15 Updated by tinita about 2 months ago

  • Status changed from Feedback to In Progress

#16 Updated by tinita about 2 months ago

I looked into all master CI runs since
"Merge pull request #1825 from perlpunk/check-for-git-status ci #1005" (Oct 28)
https://github.com/os-autoinst/os-autoinst/runs/4035778918?check_suite_focus=true

It started to fail for this one:
"Merge pull request #1855 from cfconrad/chattr_debugging ci #1145" (Nov 23)
https://github.com/os-autoinst/os-autoinst/runs/4298150735?check_suite_focus=true

For the record, all master runs:

1222: pass https://github.com/os-autoinst/os-autoinst/runs/4396177333?check_suite_focus=true
1216: pass https://github.com/os-autoinst/os-autoinst/runs/4383814133?check_suite_focus=true
1210: pass https://github.com/os-autoinst/os-autoinst/runs/4379650280?check_suite_focus=true

1204: 13-osutils.t Failed test 'Exit code appear in log'
https://github.com/os-autoinst/os-autoinst/runs/4370300894?check_suite_focus=true

1198: pass (*) https://github.com/os-autoinst/os-autoinst/runs/4367441256?check_suite_focus=true
1194: pass https://github.com/os-autoinst/os-autoinst/runs/4340445436?check_suite_focus=true
1192: pass https://github.com/os-autoinst/os-autoinst/runs/4332853295?check_suite_focus=true
1190: pass https://github.com/os-autoinst/os-autoinst/runs/4332545501?check_suite_focus=true
1188: pass https://github.com/os-autoinst/os-autoinst/runs/4331514911?check_suite_focus=true
1184: pass https://github.com/os-autoinst/os-autoinst/runs/4325174341?check_suite_focus=true

1180: 13-osutils.t Failed test 'Exit code appear in log'
https://github.com/os-autoinst/os-autoinst/runs/4324359196?check_suite_focus=true

1178: 13-osutils.t Failed test 'Exit code appear in log'
https://github.com/os-autoinst/os-autoinst/runs/4323176520?check_suite_focus=true

1174: 13-osutils.t  Failed test 'Exit code appear in log'
https://github.com/os-autoinst/os-autoinst/runs/4318559803?check_suite_focus=true

1150: 13-osutils.t  Failed test 'Exit code appear in log'
https://github.com/os-autoinst/os-autoinst/runs/4300097983?check_suite_focus=true

1145: 13-osutils.t Failed test 'Exit code appear in log'
https://github.com/os-autoinst/os-autoinst/runs/4298150735?check_suite_focus=true

1144: pass https://github.com/os-autoinst/os-autoinst/runs/4297730291?check_suite_focus=true
1143: pass https://github.com/os-autoinst/os-autoinst/runs/4297702670?check_suite_focus=true
1130: pass https://github.com/os-autoinst/os-autoinst/runs/4287579737?check_suite_focus=true
1099: pass https://github.com/os-autoinst/os-autoinst/runs/4210349893?check_suite_focus=true
1098: pass https://github.com/os-autoinst/os-autoinst/runs/4209742250?check_suite_focus=true
1082: pass https://github.com/os-autoinst/os-autoinst/runs/4165749243?check_suite_focus=true
1077: pass https://github.com/os-autoinst/os-autoinst/runs/4154751608?check_suite_focus=true
1073: pass https://github.com/os-autoinst/os-autoinst/runs/4154049211?check_suite_focus=true
1072: pass https://github.com/os-autoinst/os-autoinst/runs/4153523636?check_suite_focus=true
1068: pass https://github.com/os-autoinst/os-autoinst/runs/4153070582?check_suite_focus=true
1067: pass https://github.com/os-autoinst/os-autoinst/runs/4153068569?check_suite_focus=true
1059: pass https://github.com/os-autoinst/os-autoinst/runs/4151045057?check_suite_focus=true
1053: pass https://github.com/os-autoinst/os-autoinst/runs/4139738683?check_suite_focus=true
1050: pass https://github.com/os-autoinst/os-autoinst/runs/4139609023?check_suite_focus=true
1026: pass https://github.com/os-autoinst/os-autoinst/runs/4082501881?check_suite_focus=true
1024: pass https://github.com/os-autoinst/os-autoinst/runs/4079870003?check_suite_focus=true
1023: pass https://github.com/os-autoinst/os-autoinst/runs/4079327360?check_suite_focus=true
1016: pass https://github.com/os-autoinst/os-autoinst/runs/4049541911?check_suite_focus=true
1013: pass https://github.com/os-autoinst/os-autoinst/runs/4049078290?check_suite_focus=true
1010: pass https://github.com/os-autoinst/os-autoinst/runs/4041083982?check_suite_focus=true
1005: pass https://github.com/os-autoinst/os-autoinst/runs/4035778918?check_suite_focus=true

#17 Updated by tinita about 2 months ago

Commented here: https://github.com/os-autoinst/os-autoinst/pull/1855#issuecomment-986742654
Looks like it could be related as the test calls run_diag which was changed.

#18 Updated by tinita about 2 months ago

  • Status changed from In Progress to Feedback

Clemens confirmed Marius' suggestion that https://github.com/mudler/Mojo-IOLoop-ReadWriteProcess/pull/22
will fix this.
The version 0.31 is currently submitted to Factory: https://build.opensuse.org/request/show/935940

#19 Updated by okurz about 2 months ago

As a temporary workaround I did osc linkpac devel:languages:perl perl-Mojo-IOLoop-ReadWriteProcess devel:openQA:Leap:15.3 as multiple PRs are impacted. Let's see if this helps. After the fix reached Factory we should update the link to point to the Factory version again.

#20 Updated by tinita about 2 months ago

All 15.3 builds are failing now with the same error

perl-Mojo-IOLoop-ReadWriteProcess-0.31-lp153.30.1

https://build.opensuse.org/package/live_build_log/devel:openQA/os-autoinst/openSUSE_Leap_15.3/x86_64

[  211s] 3: [17:11:16] ./28-signalblocker.t ..................... ok      686 ms ( 0.00 usr  0.00 sys +  0.56 cusr  0.14 csys =  0.70 CPU)
[  212s] 3: 
[  212s] 3: #   Failed test 'exit logged'
[  212s] 3: #   at ./29-backend-driver.t line 31.
[  212s] 3: # STDOUT & STDERR:
[  212s] 3: # [37m[2021-12-06T17:11:16.953486Z] [debug] received magic close
[  212s] 3: # [0m
[  212s] 3: # don't match:
[  212s] 3: # (?^u:backend.*exited)
[  212s] 3: # as expected
[  212s] 3: [37m[2021-12-06T17:11:17.057808Z] [debug] backend process exited: 72057594037927935
[  212s] 3: [0m# Looks like you failed 1 test of 12.
[  212s] 3: [17:11:17] ./29-backend-driver.t .................... 
[  212s] 3: Dubious, test returned 1 (wstat 256, 0x100)
[  212s] 3: Failed 1/12 subtests 

Also the output of 10-virtio_terminal.t looks weird, but not sure if this was happening before.
It still succeeds, though:

[  196s] 3: Can't open out pipe for reading Interrupted system call at ./10-virtio_terminal.t line 44.
[  196s] 3: 
[  196s] 3:     #   Failed test 'no (unexpected) warnings (via END block)'
[  196s] 3:     #   at /usr/lib/perl5/5.26.1/Test/Builder.pm line 135.
[  196s] 3:     # Got the following unexpected warnings:
[  196s] 3:     #   1: Can't open out pipe for reading Interrupted system call at ./10-virtio_terminal.t line 44.
[  196s] 3: [33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[17:11:01] ./10-virtio_terminal.t ................... ok      820 ms ( 0.01 usr  0.00 sys +  0.71 cusr  0.13 csys =  0.85 CPU)

This is how it looks like on 15.2:

[  164s] 3: [33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[33m[0m[10:16:34] ./10-virtio_terminal.t ................... ok      835 ms ( 0.01 usr  0.00 sys +  0.71 cusr  0.14 csys =  0.86 CPU)

In any case we should get rid of the ansi codes.

Created an issue here: https://github.com/mudler/Mojo-IOLoop-ReadWriteProcess/issues/29

#21 Updated by okurz about 2 months ago

are you sure it's an upstream issue and not just what we expect in our tests and that should be made less strict?

Also, as perl-Mojo-IOLoop-ReadWriteProcess was accepted into openSUSE:Factory I now did osc linkpac -f openSUSE:Factory perl-Mojo-IOLoop-ReadWriteProcess devel:openQA:Leap:15.3

EDIT: Hm, I ensured that I have perl-Mojo-IOLoop-ReadWriteProcess-0.31-lp153.30.1 installed locally as well but I can only easily reproduce the 29-backend-driver.t problem, not easily the others of the above errors. So for me it's not happening or as sporadic as reported in before

EDIT: I know the realized the message "backend process exited: 72057594037927935". That looks certainly very wrong.

#22 Updated by okurz about 2 months ago

  • Priority changed from High to Urgent

The overall problem becomes more severe with perl-Mojo-IOLoop-ReadWriteProcess in Tumbleweed as well as in devel:openQA:Leap:15.3 and also blocking more pull requests, see e.g. https://github.com/os-autoinst/openQA/pull/4388 . Feel free to pull in more people to help or delegate.

#23 Updated by tinita about 2 months ago

  • Copied to action #103611: test failure: os-autoinst: 29-backend-driver.t:31 Failed test 'exit logged' added

#24 Updated by tinita about 2 months ago

I copied the new issue with 29-backend-driver.t et al. over to #103611

I would say the issue in this ticket with 13-osutils.t is probably solved.

#25 Updated by cdywan about 2 months ago

tinita wrote:

I copied the new issue with 29-backend-driver.t et al. over to #103611

I would say the issue in this ticket with 13-osutils.t is probably solved.

Would you mind running runs=500 count_fail_ratio make test-perl-testsuite WITH_COVER_OPTIONS=1 TESTS=t/13-osutils.t before resolving?

#26 Updated by okurz about 2 months ago

  • Copied to action #103692: Only upgrade o3 workers if package checks are good, same as for o3 webui size:M added

#27 Updated by okurz about 2 months ago

As we discussed this topic and tried to find improvements I created a ticket #103692 to improve

#28 Updated by tinita about 2 months ago

  • Status changed from Feedback to Blocked

For now this is blocked because we have the old RWP version, so tests are sporadically failing:
https://github.com/os-autoinst/os-autoinst/actions/runs/1559333666
https://github.com/os-autoinst/os-autoinst/actions/runs/1559285099

Waiting until RWP 0.32 is released and in Factory.

#29 Updated by okurz about 1 month ago

  • Status changed from Blocked to Resolved

https://build.opensuse.org/package/show/openSUSE:Factory/perl-Mojo-IOLoop-ReadWriteProcess is 0.32 now, tests in os-autoinst pass and any test failures would make the CI pipeline fail so both ACs covered and followup work defined in #103845

Also available in: Atom PDF