action #181766
closed[osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S
0%
Description
Observation¶
Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
Jobs on OSD are in Queue for 6 days:
Acceptance Criteria¶
- AC1: Jobs on OSD for i915 get scheduled within a reasonable time (<days)
Suggestions¶
-
DONE
Investigate queued up jobs in All Tests - Find correspondingly scheduled jobs on https://openqa.suse.de/tests and put into the "search" field for scheduled "i915"
- Look into the i915 specific problems https://openqa.suse.de/admin/workers/3999
- Respond back in https://suse.slack.com/archives/C02CANHLANP/p1746433919139079
- Check the job which was scheduled for 6 days: https://openqa.suse.de/tests/17479434 (it got cancelled)
- Check the journal on grenache-1 (
journalctl -u openqa-worker-auto-restart@11
):
May 03 05:31:35 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:31:45 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
…
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:42 grenache-1 worker[437757]: [info] [pid:437757] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3999
…
May 05 06:17:06 grenache-1 worker[437757]: [warn] [pid:437757] Websocket connection to http://openqa.suse.de/api/v1/ws/3999 finished by remote side with code 1006, no rea>
May 05 06:17:16 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:17:16 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: Connection refused - trying again in 10 seconds
May 05 06:17:26 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:19:46 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:03 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
May 05 06:20:13 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:33 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
Files
Updated by gpathak 11 days ago
- Copied from action #179816: [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold size:S added
Updated by gpathak 11 days ago · Edited
I looked into the setting of one job which is scheduled for 6 days: https://openqa.suse.de/tests/17479434
The WORKER_CLASS
mentioned is 64bit-i915
, and there is only one worker with 64bit-i915
Worker Class in it: https://openqa.suse.de/admin/workers/3999
The state of the worker is also a bit weird, the status of the worker says it is Working
, but the details page shows that the latest job was cancelled 11 days ago
Updated by okurz 9 days ago
Right now the worker instance which is running on grenache is "unavailable" due to too high load on grenache-1 which happened since yesterday. See https://monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&from=2025-05-01T16:35:24.553Z&to=2025-05-06T10:28:48.369Z&timezone=browser&var-datasource=000000001&viewPanel=panel-54694
Taking a look into the journal on grenache-1 I found from journalctl -u openqa-worker-auto-restart@11
May 03 05:31:35 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:31:45 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:31:55 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:31:55 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:31:55 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:32:05 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:06 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:06 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:32:16 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:16 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:16 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:32:26 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:26 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:26 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:42 grenache-1 worker[437757]: [info] [pid:437757] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3999
…
May 05 06:17:06 grenache-1 worker[437757]: [warn] [pid:437757] Websocket connection to http://openqa.suse.de/api/v1/ws/3999 finished by remote side with code 1006, no rea>
May 05 06:17:16 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:17:16 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: Connection refused - trying again in 10 seconds
May 05 06:17:26 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:19:46 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:03 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
May 05 06:20:13 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:33 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
so apparently the worker instance has problems to register with either "No route to host" or "Connection refused"
Updated by okurz 8 days ago
- Subject changed from [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold, i915 in particular to [osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by dheidler 2 days ago
- Priority changed from High to Normal
# systemctl disable --now openqa-worker@1
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@1.service.
# systemctl disable --now openqa-worker@2
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@2.service.
# systemctl disable --now openqa-worker@3
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@3.service.
Now we have decreased the system load enough to have something left and i915 jobs are being worked on.
Maybe we should move the worker slots for bare metal machines (or at least some of them) from grenache-1 to a dedicated host.
Updated by openqa_review 2 days ago
- Due date set to 2025-05-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 1 day ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/1022 to move worker instances
Updated by okurz 1 day ago
- Copied to action #182303: openQA worker instances blocked by jobs reported as "running" but according openQA jobs are already cancelled/obsoleted for long added
Updated by dheidler 1 day ago
- Copied to action #182312: Prevent starvation of bare metal worker slots (e.g. on worker36) by regular x86_64 slots taking all the system load added