Project

General

Profile

Actions

action #179816

closed

[osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold size:S

Added by emiler 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-04-02
Due date:
% Done:

0%

Estimated time:

Description

Motiviation

Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value

Acceptance Criteria

  • AC1: Reason known if any jobs are queued up for multiple days

Suggestions

  • Investigate queued up jobs in All Tests
  • File follow-up tickets

Related issues 1 (0 open1 closed)

Copied to openQA Infrastructure (public) - action #181766: [osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:SResolveddheidler2025-05-27

Actions
Actions #1

Updated by emiler 2 months ago

  • Tags set to infra, reactive work, alert
  • Subject changed from [osd][alert] Job age (scheduled) (max) alert size:s to [osd][alert] Job age (scheduled) (max) alert
  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #2

Updated by okurz about 2 months ago

  • Priority changed from Low to Urgent
Actions #3

Updated by livdywan about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

Taking a look now

Actions #4

Updated by livdywan about 2 months ago · Edited

Looking at the journal on osd I'm mostly observing that workers are free and there's a lot of investigate jobs for Windows 11:

$ sudo journalctl -S '2025-03-30 14:50:00' -U '2025-03-30 15:00:00'
[...]
Mar 30 14:50:27 openqa openqa-websockets-daemon[30438]: [debug] [pid:30438] Updating seen of worker 2522 from worker_status (free)
[...]
Mar 30 14:50:36 openqa openqa-webui-daemon[19886]:   "/var/lib/openqa/testresults/17201/17201668-sle-15-SP7-Windows_11_UEFI-x86_64-wsl-main+register:investigate:retry\@win11_uefi",

Not sure that this explains the alert. The description points to #73174#note-2 which is not really telling me much here.

I guess I will ask the team for ideas.

Actions #5

Updated by livdywan about 2 months ago

  • Status changed from In Progress to Resolved

I guess I will ask the team for ideas.

I was suggested to look at All Tests, which won't go back that far. So we presumably have no other cues to investigate here.

Actions #6

Updated by okurz about 2 months ago

  • Status changed from Resolved to New
Actions #7

Updated by livdywan about 2 months ago

  • Status changed from New to In Progress

https://openqa.suse.de/tests/17264264

Result: incomplete, finished 2 minutes ago (ran for 00:15 minutes)
Reason: asset failure: Failed to download sle-16.0-aarch64-61.3-textmode@aarch64.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-16.0-aarch64-61.3-textmode@aarch64.qcow2

That's what the web UI says. At the same time the journal says this:

Apr 04 11:19:20 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3029 rejected job(s) 17264264: The average load (26.85 27.33 24.41) is exceeding the configured threshold of 14. The worker will temporarily not accept new jobs until the load is lower again.

Should a job be able to fail because an asset can't be found and also get rejected due to high load at the same time?

Actions #8

Updated by livdywan about 2 months ago · Edited

okurz wrote in #note-6:

https://suse.slack.com/archives/C02CANHLANP/p1743745104062149

I'm guessing this is about https://openqa.suse.de/tests/17259986

Scheuduled for 15 hours as of now.

In that same timeframe I can see multiple workers rejecting jobs because they're overloaded:

Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3432 rejected job(s) 17257062: The average load (26.80 26.22 24.22) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257062 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2717 for job(s) 17257048                                                                                                                                                                               
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2717 rejected job(s) 17257048: The average load (26.19 26.23 25.07) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257048 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2555 for job(s) 17256971                                                                                                                                                                               
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2555 rejected job(s) 17256971: The average load (35.20 23.37 23.85) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17256971 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 3342 for job(s) 17257055                                                                                                                                                                               
Apr 03 20:35:01 openqa openqa-webui-daemon[487]: [debug] redirect to /assets/hdd/sle-micro-6.2-aarch64-7.1-Default-qcow@aarch64-virtio-with-ltp-uefi-vars.qcow2                                                                                                                                             
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3342 rejected job(s) 17257055: The average load (26.69 26.19 24.20) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257055 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2477 for job(s) 17257065                                                                                                                                                                               
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2724 for job(s) 17257050                                                                                                                                                                               
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2724 rejected job(s) 17257050: The average load (27.19 26.41 25.11) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257050 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2564 for job(s) 17257141                                                                                                                                                                               
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2564 rejected job(s) 17257141: The average load (34.27 23.74 23.96) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257141 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2477 accepted job 17257065                                                                                                                                                                                                 
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2532 for job(s) 17257099                                                                                                                                                                               
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2532 rejected job(s) 17257099: The average load (33.82 23.28 23.82) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257099 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2618 for job(s) 17258120                                                                                                                                                                               
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2618 rejected job(s) 17258120: The average load (27.26 26.23 24.17) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17258120 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2619 for job(s) 17258131                                                                                                                                                                               
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 3073 for job(s) 17256775                                                                                                                                                                               
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2619 rejected job(s) 17258131: The average load (23.99 25.20 23.51) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17258131 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3073 rejected job(s) 17256775: The average load (41.51 33.13 30.88) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17256775 reset to state scheduled                                                                                                                                                                                             
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2726 for job(s) 17257054                                                                                                                                                                               
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2726 rejected job(s) 17257054: The average load (26.86 26.35 25.10) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257054 reset to state scheduled

https://openqa.suse.de/tests/17257048
https://openqa.suse.de/tests/17256971
https://openqa.suse.de/tests/17257055
https://openqa.suse.de/tests/17257050
https://openqa.suse.de/tests/17257141
https://openqa.suse.de/tests/17257099
https://openqa.suse.de/tests/17258120
https://openqa.suse.de/tests/17258131
https://openqa.suse.de/tests/17256775
https://openqa.suse.de/tests/17257054

All of the rejected jobs eventually ran. Nothing unexpected here.

Actions #9

Updated by livdywan about 2 months ago · Edited

Checking recent rejected jobs for sle-micro-6.2:

Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3078 rejected job(s) 17265694: The average load (37.02 24.66 23.31) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 3078 for job(s) 17265694                                                                                                                                                                               
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17264859 reset to state scheduled                                                                                                                                                                                             
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3054 rejected job(s) 17264859: The average load (30.94 22.47 22.58) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.                            
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 3054 for job(s) 17264859                                                                                                                                                                               
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Updating seen of worker 2739 from worker_status (free)                                                                                                                                                                            
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17265585 reset to state scheduled                                                                                                                                                                                             
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3059 rejected job(s) 17265585: The average load (28.32 23.48 22.95) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again. 

https://openqa.suse.de/tests/17265694
https://openqa.suse.de/tests/17264859
https://openqa.suse.de/tests/17265585

Those were all picked up shortly after being rejected.

If something was wrong with that particular scenario or worker, it was resolved just now. Older jobs I was looking at before including https://openqa.suse.de/tests/17259986 are also running now.

Most likely this was due to high load on arm1/arm2. I was able to find multiple corresponding messages in the journal i.e. using journalctl -r -g 17265694:

rejected job(s) 17265694: The average load (37.02 24.66 23.31) is exceeding the configured threshold of 16.
Actions #10

Updated by livdywan about 2 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

I think there's nothing to be done right now. In the sense that the load limit works as intended.

That said, I'm wondering if we can adjust the alert to stop it from flagging this case as high job age. We want these jobs to be queued.

Also, see #180050 for making this situation more discoverable.

Actions #11

Updated by livdywan about 2 months ago

  • Subject changed from [osd][alert] Job age (scheduled) (max) alert to [osd][alert] Job age (scheduled) (max) alert due to overloaded worker
Actions #12

Updated by okurz about 2 months ago

Yeah but the alert only triggers if jobs are queued for 4 days (!)

Actions #13

Updated by livdywan about 2 months ago

okurz wrote in #note-12:

Yeah but the alert only triggers if jobs are queued for 4 days (!)

The jobs I was investigating weren't that old. I don't know what the original jobs were.

Actions #14

Updated by livdywan about 2 months ago

  • Subject changed from [osd][alert] Job age (scheduled) (max) alert due to overloaded worker to [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold size:S
  • Description updated (diff)
Actions #15

Updated by livdywan about 2 months ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved

So we discussed it in detail during the estimation. I worked on this ticket in terms of #note-6 and probably should have rejected this ticket and opened a new one for clarity. We have no way of knowing what the original ticket was about.

Actions #16

Updated by gpathak 28 days ago

  • Copied to action #181766: [osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S added
Actions

Also available in: Atom PDF