Project

General

Profile

action #65082

flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.t

Added by okurz 10 months ago. Updated 6 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2020-03-31
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Seen in https://circleci.com/api/v1.1/project/github/os-autoinst/openQA/19175/output/111/0?file=true&allocation-id=5e83023bf2a75b19962e1646-0-build%2F6148E966 but also in at least one other case:

[08:55:42] t/43-scheduling-and-worker-scalability.t .. 1/? 
    #   Failed test 'all jobs done'
    #   at t/43-scheduling-and-worker-scalability.t line 196.
    #          got: '4'
    #     expected: '5'

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 197.
    #          got: '4'
    #     expected: '5'
    # Looks like you failed 2 tests of 3.

[08:55:42] t/43-scheduling-and-worker-scalability.t .. 2/? 
#   Failed test 'assign and run jobs'
#   at t/43-scheduling-and-worker-scalability.t line 199.

[08:55:42] t/43-scheduling-and-worker-scalability.t .. 3/? # Looks like you failed 1 test of 4.


[08:55:42] t/43-scheduling-and-worker-scalability.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/4 subtests 
[08:56:13]

Test Summary Report
-------------------
t/43-scheduling-and-worker-scalability.t (Wstat: 256 Tests: 4 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
Files=2, Tests=9, 88.8589 wallclock secs ( 0.59 usr  0.04 sys + 183.69 cusr 28.88 csys = 213.20 CPU)
Result: FAIL
Retrying...
[08:56:13] t/05-scheduler-full.t ..................... 
[08:56:13] t/05-scheduler-full.t ..................... 1/? 
[08:56:13] t/05-scheduler-full.t ..................... 2/? 
[08:56:13] t/05-scheduler-full.t ..................... 3/? 
[08:56:13] t/05-scheduler-full.t ..................... 4/? 
[08:56:13] t/05-scheduler-full.t ..................... 5/? 

[08:56:13] t/05-scheduler-full.t ..................... ok    53057 ms ( 0.43 usr  0.03 sys + 146.90 cusr 12.32 csys = 159.68 CPU)
[08:57:06] t/43-scheduling-and-worker-scalability.t .. 
[08:57:06] t/43-scheduling-and-worker-scalability.t .. 1/? 
    #   Failed test 'all jobs done'
    #   at t/43-scheduling-and-worker-scalability.t line 196.
    #          got: '4'
    #     expected: '5'

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 197.
    #          got: '4'
    #     expected: '5'
    # Looks like you failed 2 tests of 3.

[08:57:06] t/43-scheduling-and-worker-scalability.t .. 2/? 
#   Failed test 'assign and run jobs'
#   at t/43-scheduling-and-worker-scalability.t line 199.

[08:57:06] t/43-scheduling-and-worker-scalability.t .. 3/? # Looks like you failed 1 test of 4.


[08:57:06] t/43-scheduling-and-worker-scalability.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/4 subtests 
[08:57:39]

Test Summary Report
-------------------
t/43-scheduling-and-worker-scalability.t (Wstat: 256 Tests: 4 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
Files=2, Tests=9, 85.2037 wallclock secs ( 0.57 usr  0.04 sys + 177.56 cusr 16.58 csys = 194.75 CPU)
Result: FAIL
Retrying...
[08:57:39] t/05-scheduler-full.t ..................... 
[08:57:39] t/05-scheduler-full.t ..................... 1/? 
[08:57:39] t/05-scheduler-full.t ..................... 2/? 
[08:57:39] t/05-scheduler-full.t ..................... 3/? 
[08:57:39] t/05-scheduler-full.t ..................... 4/? 
[08:57:39] t/05-scheduler-full.t ..................... 5/? 

[08:57:39] t/05-scheduler-full.t ..................... ok    55655 ms ( 0.45 usr  0.03 sys + 119.12 cusr 12.33 csys = 131.93 CPU)
[08:58:35] t/43-scheduling-and-worker-scalability.t .. 
[08:58:35] t/43-scheduling-and-worker-scalability.t .. 1/? 
    #   Failed test 'all jobs done'
    #   at t/43-scheduling-and-worker-scalability.t line 196.
    #          got: '4'
    #     expected: '5'

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 197.
    #          got: '4'
    #     expected: '5'
    # Looks like you failed 2 tests of 3.

[08:58:35] t/43-scheduling-and-worker-scalability.t .. 2/? 

that is failing after 3 retries with always the same failure.

Acceptance criteria

  • AC1: t/43-scheduling-and-worker-scalability.t is stable, e.g. 100/100 runs on circleCI as well as local

Suggestions

  • Run make test TESTS=t/43-scheduling-and-worker-scalability.t SCALABILITY_TEST=1 RETRY=100

History

#1 Updated by okurz 10 months ago

  • Status changed from In Progress to Feedback

#2 Updated by okurz 10 months ago

https://github.com/os-autoinst/openQA/pull/2882 is the first part from Martchus bumping the internal timeout waiting for jobs to finish. https://github.com/os-autoinst/openQA/pull/2885#issuecomment-610291020 describes a new, unexpected problem in the scheduler "full" test.

#3 Updated by cdywan 8 months ago

  • Description updated (diff)

#4 Updated by okurz 7 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from High to Low

I have not seen the test module t/43-scheduling-and-worker-scalability.t failing on the top level since we have a RETRY=3 and also my recent changes to the test setup, e.g. using dynamic-but-ensured-to-be-free ports. To complete this ticket one would still need to ensure stability without RETRY=3

#5 Updated by okurz 6 months ago

  • Target version set to Ready

Also available in: Atom PDF