Project

General

Profile

Actions

action #165198

closed

[sporadic] ci: t/05-scheduler-full.t and t/43-scheduling-and-worker-scalability.t failing

Added by tinita about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-08-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

In https://github.com/os-autoinst/openQA/pull/5843 and https://github.com/os-autoinst/openQA/pull/5848 we see the unstable and fullstack-unstable test(s) failing.
https://app.circleci.com/pipelines/github/perlpunk/openQA/357/workflows/d8ac13a9-ce17-413a-bbc2-8b9779b9e1a2/jobs/2482

[13:20:55] t/05-scheduler-full.t .. 2/? 
    #   Failed test 'one job allocated'
    #   at t/05-scheduler-full.t line 165.
    #          got: '0'
    #     expected: '1'

    #   Failed test 'one job allocated'
    #   at t/05-scheduler-full.t line 195.
    #          got: '0'
    #     expected: '1'
    # Looks like you failed 2 tests of 8.
[13:20:55] t/05-scheduler-full.t .. 3/? 
#   Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
#   at t/05-scheduler-full.t line 213.
[WARN] Failed sending job(s) '99927' to worker '6': Unable to assign job to worker 6: the worker is not connected anymore
[WARN] Failed sending job(s) '99983' to worker '34': Unable to assign job to worker 34: the worker is not connected anymore
[WARN] Failed sending job(s) '99988' to worker '39': Unable to assign job to worker 39: the worker is not connected anymore
[WARN] Failed sending job(s) '99987' to worker '8': Unable to assign job to worker 8: the worker is not connected anymore
[WARN] Failed sending job(s) '99928' to worker '12': Unable to assign job to worker 12: the worker is not connected anymore
[WARN] Failed sending job(s) '99989' to worker '22': Unable to assign job to worker 22: the worker is not connected anymore
[WARN] Failed sending job(s) '99990' to worker '7': Unable to assign job to worker 7: the worker is not connected anymore
[WARN] Failed sending job(s) '99984' to worker '11': Unable to assign job to worker 11: the worker is not connected anymore
[WARN] Failed sending job(s) '99985' to worker '3': Unable to assign job to worker 3: the worker is not connected anymore
[WARN] Failed sending job(s) '99991' to worker '33': Unable to assign job to worker 33: the worker is not connected anymore
[13:20:55] t/05-scheduler-full.t .. 5/? # Looks like you failed 1 test of 6.
[13:20:55] t/05-scheduler-full.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/6 subtests 
[13:23:16]

https://app.circleci.com/pipelines/github/os-autoinst/openQA/14280/workflows/53f03097-0191-4093-a077-4761e29180a4/jobs/134859

[11:03:20] t/43-scheduling-and-worker-scalability.t .. 
    #   Failed test 'all workers idling'
    #   at t/43-scheduling-and-worker-scalability.t line 146.
[11:03:20] t/43-scheduling-and-worker-scalability.t .. 1/?     # 0
    # Looks like you failed 1 test of 2.

#   Failed test 'wait for workers to be idle'
#   at t/43-scheduling-and-worker-scalability.t line 147.
    # 0
    # 0
Bailout called.  Further testing stopped:  Unable to assign jobs to (idling) workers

Files

05-scheduler-full.t (164 KB) 05-scheduler-full.t tinita, 2024-08-13 15:15
Actions #1

Updated by okurz about 2 months ago

  • Tags set to reactive work
  • Priority changed from High to Urgent

I would like us to treat this with urgency as so far it looks like a recent regression so we should identify the cause

Actions #2

Updated by livdywan about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

So first three questions I'm thinking we need to answer:

  • What introduced this regression?
  • How reproducible is it? We've seen it in 2 PR's so far.
  • Does this reproduce on o3 or osd? sudo journalctl -g 'Unable to assign job to worker' might tell us that?
Actions #4

Updated by livdywan about 2 months ago

Trying to (re)run these PR's atm to confirm if this is actually in openQA, or maybe even CircleCI/codecov:

Actions #5

Updated by tinita about 2 months ago

  • Priority changed from Urgent to High

This seems to happen if the machines on CircleCI are under a heavy load.
Workers don't accept new tests when the load is too high.
See https://progress.opensuse.org/attachments/18222#L33

[error] [pid:1531] The average load 49.71 is exceeding the configured threshold of 40.

https://github.com/os-autoinst/openQA/pull/5852 ci: Ensure tests pass even under high load

Lowering prio

Actions #6

Updated by livdywan about 2 months ago

  • Status changed from In Progress to Feedback

tinita wrote in #note-5:

This seems to happen if the machines on CircleCI are under a heavy load.
Workers don't accept new tests when the load is too high.
See https://progress.opensuse.org/attachments/18222#L33

[error] [pid:1531] The average load 49.71 is exceeding the configured threshold of 40.

https://github.com/os-autoinst/openQA/pull/5852 ci: Ensure tests pass even under high load

Lowering prio

Good catch! I hadn't seen the message

Actions #7

Updated by tinita about 2 months ago

  • Status changed from Feedback to In Progress

https://github.com/os-autoinst/openQA/pull/5853 Move simulating load to a function

Actions #8

Updated by openqa_review about 2 months ago

  • Due date set to 2024-08-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by okurz about 2 months ago

  • Due date deleted (2024-08-28)
  • Status changed from In Progress to Resolved

https://github.com/os-autoinst/openQA/pull/5853 merged. That should suffice for this ticket. Thanks tinita and livdywan for your work.

Actions

Also available in: Atom PDF