Project

General

Profile

Actions

action #181766

closed

[osd][alert] i915 worker instance not working on jobs since weeks (was: "Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold") size:S

Added by gpathak 11 days ago. Updated 1 day ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
2025-05-27
% Done:

0%

Estimated time:

Description

Observation

Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value

Jobs on OSD are in Queue for 6 days:

Acceptance Criteria

  • AC1: Jobs on OSD for i915 get scheduled within a reasonable time (<days)

Suggestions

May 03 05:31:35 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:31:45 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:31:45 grenache-1 worker[437757]: [warn] [pid:437757] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3999, code 502 - trying again in 10 >
…
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 03 05:32:36 grenache-1 worker[437757]: [info] [pid:437757] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3999
May 03 05:32:42 grenache-1 worker[437757]: [info] [pid:437757] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3999
…
May 05 06:17:06 grenache-1 worker[437757]: [warn] [pid:437757] Websocket connection to http://openqa.suse.de/api/v1/ws/3999 finished by remote side with code 1006, no rea>
May 05 06:17:16 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:17:16 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: Connection refused - trying again in 10 seconds
May 05 06:17:26 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:19:46 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:03 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds
May 05 06:20:13 grenache-1 worker[437757]: [info] [pid:437757] Registering with openQA openqa.suse.de
May 05 06:20:33 grenache-1 worker[437757]: [warn] [pid:437757] Failed to register at openqa.suse.de - connection error: No route to host - trying again in 10 seconds

Files


Related issues 3 (2 open1 closed)

Copied from openQA Infrastructure (public) - action #179816: [osd][alert] Job age (scheduled) (max) alert possibly due to workers exceeding the configured load threshold size:SResolvedlivdywan2025-04-02

Actions
Copied to openQA Project (public) - action #182303: openQA worker instances blocked by jobs reported as "running" but according openQA jobs are already cancelled/obsoleted for longFeedbackmkittler2025-05-132025-05-28

Actions
Copied to openQA Infrastructure (public) - action #182312: Prevent starvation of bare metal worker slots (e.g. on worker36) by regular x86_64 slots taking all the system loadNew2025-05-132025-05-27

Actions
Actions

Also available in: Atom PDF