action #165198: [sporadic] ci: t/05-scheduler-full.t and t/43-scheduling-and-worker-scalability.t failing - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #165198

closed

[sporadic] ci: t/05-scheduler-full.t and t/43-scheduling-and-worker-scalability.t failing

Added by tinita 7 months ago. Updated 7 months ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-08-13

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

In https://github.com/os-autoinst/openQA/pull/5843 and https://github.com/os-autoinst/openQA/pull/5848 we see the unstable and fullstack-unstable test(s) failing.
https://app.circleci.com/pipelines/github/perlpunk/openQA/357/workflows/d8ac13a9-ce17-413a-bbc2-8b9779b9e1a2/jobs/2482

[13:20:55] t/05-scheduler-full.t .. 2/? 
    #   Failed test 'one job allocated'
    #   at t/05-scheduler-full.t line 165.
    #          got: '0'
    #     expected: '1'

    #   Failed test 'one job allocated'
    #   at t/05-scheduler-full.t line 195.
    #          got: '0'
    #     expected: '1'
    # Looks like you failed 2 tests of 8.
[13:20:55] t/05-scheduler-full.t .. 3/? 
#   Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
#   at t/05-scheduler-full.t line 213.
[WARN] Failed sending job(s) '99927' to worker '6': Unable to assign job to worker 6: the worker is not connected anymore
[WARN] Failed sending job(s) '99983' to worker '34': Unable to assign job to worker 34: the worker is not connected anymore
[WARN] Failed sending job(s) '99988' to worker '39': Unable to assign job to worker 39: the worker is not connected anymore
[WARN] Failed sending job(s) '99987' to worker '8': Unable to assign job to worker 8: the worker is not connected anymore
[WARN] Failed sending job(s) '99928' to worker '12': Unable to assign job to worker 12: the worker is not connected anymore
[WARN] Failed sending job(s) '99989' to worker '22': Unable to assign job to worker 22: the worker is not connected anymore
[WARN] Failed sending job(s) '99990' to worker '7': Unable to assign job to worker 7: the worker is not connected anymore
[WARN] Failed sending job(s) '99984' to worker '11': Unable to assign job to worker 11: the worker is not connected anymore
[WARN] Failed sending job(s) '99985' to worker '3': Unable to assign job to worker 3: the worker is not connected anymore
[WARN] Failed sending job(s) '99991' to worker '33': Unable to assign job to worker 33: the worker is not connected anymore
[13:20:55] t/05-scheduler-full.t .. 5/? # Looks like you failed 1 test of 6.
[13:20:55] t/05-scheduler-full.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/6 subtests 
[13:23:16]

https://app.circleci.com/pipelines/github/os-autoinst/openQA/14280/workflows/53f03097-0191-4093-a077-4761e29180a4/jobs/134859

[11:03:20] t/43-scheduling-and-worker-scalability.t .. 
    #   Failed test 'all workers idling'
    #   at t/43-scheduling-and-worker-scalability.t line 146.
[11:03:20] t/43-scheduling-and-worker-scalability.t .. 1/?     # 0
    # Looks like you failed 1 test of 2.

#   Failed test 'wait for workers to be idle'
#   at t/43-scheduling-and-worker-scalability.t line 147.
    # 0
    # 0
Bailout called.  Further testing stopped:  Unable to assign jobs to (idling) workers

Files

05-scheduler-full.t (164 KB) 05-scheduler-full.t

tinita, 2024-08-13 15:15

Actions

Copy link

Updated by okurz 7 months ago

Tags set to reactive work
Priority changed from High to Urgent

I would like us to treat this with urgency as so far it looks like a recent regression so we should identify the cause

Actions

Copy link

Updated by livdywan 7 months ago

Status changed from New to In Progress
Assignee set to livdywan

So first three questions I'm thinking we need to answer:

What introduced this regression?
How reproducible is it? We've seen it in 2 PR's so far.
Does this reproduce on o3 or osd? sudo journalctl -g 'Unable to assign job to worker' might tell us that?

Actions

Copy link

Updated by tinita 7 months ago

File 05-scheduler-full.t 05-scheduler-full.t added

Attached the full log output from the artifacts https://app.circleci.com/pipelines/github/perlpunk/openQA/357/workflows/206da54f-1691-48b6-91d1-f9a4c30164d3/jobs/2484/artifacts

Actions

Copy link

Updated by livdywan 7 months ago

Trying to (re)run these PR's atm to confirm if this is actually in openQA, or maybe even CircleCI/codecov:

Actions

Copy link

Updated by tinita 7 months ago

Priority changed from Urgent to High

This seems to happen if the machines on CircleCI are under a heavy load.
Workers don't accept new tests when the load is too high.
See https://progress.opensuse.org/attachments/18222#L33

[error] [pid:1531] The average load 49.71 is exceeding the configured threshold of 40.

https://github.com/os-autoinst/openQA/pull/5852 ci: Ensure tests pass even under high load

Lowering prio

Actions

Copy link

Updated by livdywan 7 months ago

Status changed from In Progress to Feedback

tinita wrote in #note-5:

This seems to happen if the machines on CircleCI are under a heavy load.
Workers don't accept new tests when the load is too high.
See https://progress.opensuse.org/attachments/18222#L33
[error] [pid:1531] The average load 49.71 is exceeding the configured threshold of 40.
https://github.com/os-autoinst/openQA/pull/5852 ci: Ensure tests pass even under high load

Lowering prio

Good catch! I hadn't seen the message

Actions

Copy link

Updated by tinita 7 months ago

Status changed from Feedback to In Progress

https://github.com/os-autoinst/openQA/pull/5853 Move simulating load to a function

Actions

Copy link

Updated by openqa_review 7 months ago

Due date set to 2024-08-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz 7 months ago

Due date deleted (~~2024-08-28~~)
Status changed from In Progress to Resolved

https://github.com/os-autoinst/openQA/pull/5853 merged. That should suffice for this ticket. Thanks tinita and livdywan for your work.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #165198

[sporadic] ci: t/05-scheduler-full.t and t/43-scheduling-and-worker-scalability.t failing

Observation¶

Updated by okurz 7 months ago

Updated by livdywan 7 months ago

Updated by tinita 7 months ago

Updated by livdywan 7 months ago

Updated by tinita 7 months ago

Updated by livdywan 7 months ago

Updated by tinita 7 months ago

Updated by openqa_review 7 months ago

Updated by okurz 7 months ago