action #165198
closed[sporadic] ci: t/05-scheduler-full.t and t/43-scheduling-and-worker-scalability.t failing
0%
Description
Observation¶
In https://github.com/os-autoinst/openQA/pull/5843 and https://github.com/os-autoinst/openQA/pull/5848 we see the unstable and fullstack-unstable test(s) failing.
https://app.circleci.com/pipelines/github/perlpunk/openQA/357/workflows/d8ac13a9-ce17-413a-bbc2-8b9779b9e1a2/jobs/2482
[13:20:55] t/05-scheduler-full.t .. 2/?
# Failed test 'one job allocated'
# at t/05-scheduler-full.t line 165.
# got: '0'
# expected: '1'
# Failed test 'one job allocated'
# at t/05-scheduler-full.t line 195.
# got: '0'
# expected: '1'
# Looks like you failed 2 tests of 8.
[13:20:55] t/05-scheduler-full.t .. 3/?
# Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
# at t/05-scheduler-full.t line 213.
[WARN] Failed sending job(s) '99927' to worker '6': Unable to assign job to worker 6: the worker is not connected anymore
[WARN] Failed sending job(s) '99983' to worker '34': Unable to assign job to worker 34: the worker is not connected anymore
[WARN] Failed sending job(s) '99988' to worker '39': Unable to assign job to worker 39: the worker is not connected anymore
[WARN] Failed sending job(s) '99987' to worker '8': Unable to assign job to worker 8: the worker is not connected anymore
[WARN] Failed sending job(s) '99928' to worker '12': Unable to assign job to worker 12: the worker is not connected anymore
[WARN] Failed sending job(s) '99989' to worker '22': Unable to assign job to worker 22: the worker is not connected anymore
[WARN] Failed sending job(s) '99990' to worker '7': Unable to assign job to worker 7: the worker is not connected anymore
[WARN] Failed sending job(s) '99984' to worker '11': Unable to assign job to worker 11: the worker is not connected anymore
[WARN] Failed sending job(s) '99985' to worker '3': Unable to assign job to worker 3: the worker is not connected anymore
[WARN] Failed sending job(s) '99991' to worker '33': Unable to assign job to worker 33: the worker is not connected anymore
[13:20:55] t/05-scheduler-full.t .. 5/? # Looks like you failed 1 test of 6.
[13:20:55] t/05-scheduler-full.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/6 subtests
[13:23:16]
[11:03:20] t/43-scheduling-and-worker-scalability.t ..
# Failed test 'all workers idling'
# at t/43-scheduling-and-worker-scalability.t line 146.
[11:03:20] t/43-scheduling-and-worker-scalability.t .. 1/? # 0
# Looks like you failed 1 test of 2.
# Failed test 'wait for workers to be idle'
# at t/43-scheduling-and-worker-scalability.t line 147.
# 0
# 0
Bailout called. Further testing stopped: Unable to assign jobs to (idling) workers
Files
Updated by okurz about 2 months ago
- Tags set to reactive work
- Priority changed from High to Urgent
I would like us to treat this with urgency as so far it looks like a recent regression so we should identify the cause
Updated by livdywan about 2 months ago
- Status changed from New to In Progress
- Assignee set to livdywan
So first three questions I'm thinking we need to answer:
- What introduced this regression?
- How reproducible is it? We've seen it in 2 PR's so far.
- Does this reproduce on o3 or osd?
sudo journalctl -g 'Unable to assign job to worker'
might tell us that?
Updated by tinita about 2 months ago
- File 05-scheduler-full.t 05-scheduler-full.t added
Attached the full log output from the artifacts https://app.circleci.com/pipelines/github/perlpunk/openQA/357/workflows/206da54f-1691-48b6-91d1-f9a4c30164d3/jobs/2484/artifacts
Updated by livdywan about 2 months ago
Trying to (re)run these PR's atm to confirm if this is actually in openQA, or maybe even CircleCI/codecov:
Updated by tinita about 2 months ago
- Priority changed from Urgent to High
This seems to happen if the machines on CircleCI are under a heavy load.
Workers don't accept new tests when the load is too high.
See https://progress.opensuse.org/attachments/18222#L33
[error] [pid:1531] The average load 49.71 is exceeding the configured threshold of 40.
https://github.com/os-autoinst/openQA/pull/5852 ci: Ensure tests pass even under high load
Lowering prio
Updated by livdywan about 2 months ago
- Status changed from In Progress to Feedback
tinita wrote in #note-5:
This seems to happen if the machines on CircleCI are under a heavy load.
Workers don't accept new tests when the load is too high.
See https://progress.opensuse.org/attachments/18222#L33[error] [pid:1531] The average load 49.71 is exceeding the configured threshold of 40.
https://github.com/os-autoinst/openQA/pull/5852 ci: Ensure tests pass even under high load
Lowering prio
Good catch! I hadn't seen the message
Updated by tinita about 2 months ago
- Status changed from Feedback to In Progress
https://github.com/os-autoinst/openQA/pull/5853 Move simulating load to a function
Updated by openqa_review about 2 months ago
- Due date set to 2024-08-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 2 months ago
- Due date deleted (
2024-08-28) - Status changed from In Progress to Resolved
https://github.com/os-autoinst/openQA/pull/5853 merged. That should suffice for this ticket. Thanks tinita and livdywan for your work.