action #179816
closed[osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold size:S
0%
Description
Motiviation¶
Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
- https://mailman.suse.de/mlarch/SuSE/osd-admins/2025/osd-admins.2025.03/msg00338.html
- https://monitor.qa.suse.de/alerting/grafana/XzAh5mfVz/view?orgId=1
Acceptance Criteria¶
- AC1: Reason known if any jobs are queued up for multiple days
Suggestions¶
- Investigate queued up jobs in All Tests
- File follow-up tickets
Updated by livdywan about 2 months ago
- Status changed from New to In Progress
- Assignee set to livdywan
Taking a look now
Updated by livdywan about 2 months ago · Edited
Looking at the journal on osd I'm mostly observing that workers are free and there's a lot of investigate jobs for Windows 11:
$ sudo journalctl -S '2025-03-30 14:50:00' -U '2025-03-30 15:00:00'
[...]
Mar 30 14:50:27 openqa openqa-websockets-daemon[30438]: [debug] [pid:30438] Updating seen of worker 2522 from worker_status (free)
[...]
Mar 30 14:50:36 openqa openqa-webui-daemon[19886]: "/var/lib/openqa/testresults/17201/17201668-sle-15-SP7-Windows_11_UEFI-x86_64-wsl-main+register:investigate:retry\@win11_uefi",
Not sure that this explains the alert. The description points to #73174#note-2 which is not really telling me much here.
I guess I will ask the team for ideas.
Updated by livdywan about 2 months ago
- Status changed from In Progress to Resolved
I guess I will ask the team for ideas.
I was suggested to look at All Tests, which won't go back that far. So we presumably have no other cues to investigate here.
Updated by okurz about 2 months ago
- Status changed from Resolved to New
Updated by livdywan about 2 months ago
- Status changed from New to In Progress
https://openqa.suse.de/tests/17264264
Result: incomplete, finished 2 minutes ago (ran for 00:15 minutes)
Reason: asset failure: Failed to download sle-16.0-aarch64-61.3-textmode@aarch64.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-16.0-aarch64-61.3-textmode@aarch64.qcow2
That's what the web UI says. At the same time the journal says this:
Apr 04 11:19:20 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3029 rejected job(s) 17264264: The average load (26.85 27.33 24.41) is exceeding the configured threshold of 14. The worker will temporarily not accept new jobs until the load is lower again.
Should a job be able to fail because an asset can't be found and also get rejected due to high load at the same time?
Updated by livdywan about 2 months ago · Edited
okurz wrote in #note-6:
https://suse.slack.com/archives/C02CANHLANP/p1743745104062149
I'm guessing this is about https://openqa.suse.de/tests/17259986
Scheuduled for 15 hours as of now.
In that same timeframe I can see multiple workers rejecting jobs because they're overloaded:
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3432 rejected job(s) 17257062: The average load (26.80 26.22 24.22) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257062 reset to state scheduled
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2717 for job(s) 17257048
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2717 rejected job(s) 17257048: The average load (26.19 26.23 25.07) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257048 reset to state scheduled
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2555 for job(s) 17256971
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2555 rejected job(s) 17256971: The average load (35.20 23.37 23.85) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17256971 reset to state scheduled
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 3342 for job(s) 17257055
Apr 03 20:35:01 openqa openqa-webui-daemon[487]: [debug] redirect to /assets/hdd/sle-micro-6.2-aarch64-7.1-Default-qcow@aarch64-virtio-with-ltp-uefi-vars.qcow2
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3342 rejected job(s) 17257055: The average load (26.69 26.19 24.20) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257055 reset to state scheduled
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2477 for job(s) 17257065
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2724 for job(s) 17257050
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2724 rejected job(s) 17257050: The average load (27.19 26.41 25.11) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257050 reset to state scheduled
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2564 for job(s) 17257141
Apr 03 20:35:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2564 rejected job(s) 17257141: The average load (34.27 23.74 23.96) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257141 reset to state scheduled
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2477 accepted job 17257065
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2532 for job(s) 17257099
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2532 rejected job(s) 17257099: The average load (33.82 23.28 23.82) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257099 reset to state scheduled
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2618 for job(s) 17258120
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2618 rejected job(s) 17258120: The average load (27.26 26.23 24.17) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17258120 reset to state scheduled
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2619 for job(s) 17258131
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 3073 for job(s) 17256775
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2619 rejected job(s) 17258131: The average load (23.99 25.20 23.51) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17258131 reset to state scheduled
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3073 rejected job(s) 17256775: The average load (41.51 33.13 30.88) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17256775 reset to state scheduled
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 2726 for job(s) 17257054
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 2726 rejected job(s) 17257054: The average load (26.86 26.35 25.10) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
Apr 03 20:35:02 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17257054 reset to state scheduled
https://openqa.suse.de/tests/17257048
https://openqa.suse.de/tests/17256971
https://openqa.suse.de/tests/17257055
https://openqa.suse.de/tests/17257050
https://openqa.suse.de/tests/17257141
https://openqa.suse.de/tests/17257099
https://openqa.suse.de/tests/17258120
https://openqa.suse.de/tests/17258131
https://openqa.suse.de/tests/17256775
https://openqa.suse.de/tests/17257054
All of the rejected jobs eventually ran. Nothing unexpected here.
Updated by livdywan about 2 months ago · Edited
Checking recent rejected jobs for sle-micro-6.2:
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3078 rejected job(s) 17265694: The average load (37.02 24.66 23.31) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 3078 for job(s) 17265694
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17264859 reset to state scheduled
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3054 rejected job(s) 17264859: The average load (30.94 22.47 22.58) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Started to send message to 3054 for job(s) 17264859
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Updating seen of worker 2739 from worker_status (free)
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Job 17265585 reset to state scheduled
Apr 04 11:58:01 openqa openqa-websockets-daemon[9066]: [debug] [pid:9066] Worker 3059 rejected job(s) 17265585: The average load (28.32 23.48 22.95) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.
https://openqa.suse.de/tests/17265694
https://openqa.suse.de/tests/17264859
https://openqa.suse.de/tests/17265585
Those were all picked up shortly after being rejected.
If something was wrong with that particular scenario or worker, it was resolved just now. Older jobs I was looking at before including https://openqa.suse.de/tests/17259986 are also running now.
Most likely this was due to high load on arm1/arm2. I was able to find multiple corresponding messages in the journal i.e. using journalctl -r -g 17265694
:
rejected job(s) 17265694: The average load (37.02 24.66 23.31) is exceeding the configured threshold of 16.
Updated by livdywan about 2 months ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
I think there's nothing to be done right now. In the sense that the load limit works as intended.
That said, I'm wondering if we can adjust the alert to stop it from flagging this case as high job age. We want these jobs to be queued.
Also, see #180050 for making this situation more discoverable.
Updated by livdywan about 2 months ago
- Subject changed from [osd][alert] Job age (scheduled) (max) alert to [osd][alert] Job age (scheduled) (max) alert due to overloaded worker
Updated by okurz about 2 months ago
Yeah but the alert only triggers if jobs are queued for 4 days (!)
Updated by livdywan about 2 months ago
okurz wrote in #note-12:
Yeah but the alert only triggers if jobs are queued for 4 days (!)
The jobs I was investigating weren't that old. I don't know what the original jobs were.
Updated by livdywan about 2 months ago
- Subject changed from [osd][alert] Job age (scheduled) (max) alert due to overloaded worker to [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold size:S
- Description updated (diff)
Updated by livdywan about 2 months ago
- Description updated (diff)
- Status changed from Feedback to Resolved
So we discussed it in detail during the estimation. I worked on this ticket in terms of #note-6 and probably should have rejected this ticket and opened a new one for clarity. We have no way of knowing what the original ticket was about.
Updated by gpathak 28 days ago
- Copied to action #181766: [osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S added