action #181766
closed
[osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S
Added by gpathak 11 days ago.
Updated 1 day ago.
Category:
Regressions/Crashes
Description
Observation¶
Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
Jobs on OSD are in Queue for 6 days:

Acceptance Criteria¶
-
AC1: Jobs on OSD for i915 get scheduled within a reasonable time (<days)
Suggestions¶
May 03 05:31:35 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:31:45 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
…
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:42 grenache-1 worker[437757]: [info] [pid:437757] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3999
…
May 05 06:17:06 grenache-1 worker[437757]: [warn] [pid:437757] Websocket connection to http://openqa.suse.de/api/v1/ws/3999 finished by remote side with code 1006, no rea>
May 05 06:17:16 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:17:16 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: Connection refused - trying again in 10 seconds
May 05 06:17:26 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:19:46 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:03 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
May 05 06:20:13 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:33 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
Files
- Copied from action #179816: [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold size:S added
I looked into the setting of one job which is scheduled for 6 days: https://openqa.suse.de/tests/17479434
The WORKER_CLASS
mentioned is 64bit-i915
, and there is only one worker with 64bit-i915
Worker Class in it: https://openqa.suse.de/admin/workers/3999
The state of the worker is also a bit weird, the status of the worker says it is Working
, but the details page shows that the latest job was cancelled 11 days ago

- Subject changed from [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold to [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold, i915 in particular
- Description updated (diff)
Right now the worker instance which is running on grenache is "unavailable" due to too high load on grenache-1 which happened since yesterday. See https://monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?orgId=1&from=2025-05-01T16:35:24.553Z&to=2025-05-06T10:28:48.369Z&timezone=browser&var-datasource=000000001&viewPanel=panel-54694
Taking a look into the journal on grenache-1 I found from journalctl -u openqa-worker-auto-restart@11
May 03 05:31:35 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:31:45 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:31:55 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:31:55 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:31:55 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:32:05 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:06 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:06 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:32:16 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:16 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:16 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:32:26 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:26 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:26 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:42 grenache-1 worker[437757]: [info] [pid:437757] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3999
…
May 05 06:17:06 grenache-1 worker[437757]: [warn] [pid:437757] Websocket connection to http://openqa.suse.de/api/v1/ws/3999 finished by remote side with code 1006, no rea>
May 05 06:17:16 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:17:16 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: Connection refused - trying again in 10 seconds
May 05 06:17:26 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:19:46 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:03 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
May 05 06:20:13 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:33 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
so apparently the worker instance has problems to register with either "No route to host" or "Connection refused"
- Subject changed from [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold, i915 in particular to [osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Priority changed from High to Normal
# systemctl disable --now openqa-worker@1
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@1.service.
# systemctl disable --now openqa-worker@2
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@2.service.
# systemctl disable --now openqa-worker@3
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@3.service.
Now we have decreased the system load enough to have something left and i915 jobs are being worked on.
Maybe we should move the worker slots for bare metal machines (or at least some of them) from grenache-1 to a dedicated host.
Also there seems to has been a job stuck.
The worker entry in the osd db was in state "working", but the referenced job was in state "canceled".
And the worker itself was not working but waiting for jobs.
openqa-cli api --osd -X DELETE jobs/17335314
Now we can move instances to another machine after we managed to recently increase the number of worker instances that can be connected to one webUI instance
- Due date set to 2025-05-27
Setting due date based on mean cycle time of SUSE QE Tools
Reenabled lpar worker slots 1-3 on grenache-1
- Status changed from In Progress to Feedback
- Copied to action #182303: openQA worker instances blocked by jobs reported as "running" but according openQA jobs are already cancelled/obsoleted for long added
- Status changed from Feedback to Resolved
- Copied to action #182312: Prevent starvation of bare metal worker slots (e.g. on worker36) by regular x86_64 slots taking all the system load added
Also available in: Atom
PDF