Project

General

Profile

Actions

action #135644

open

Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
QA (public, currently private due to #173521) - future
Start date:
2023-09-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/admin/workers/885 shows last job from 2023-09-13 early morning but many qemu-ppc64le are currently scheduled

The problem is not that any of the services is not responding. The worker slots and the scheduler and websocket server all look good. It looks like the source of the problem is the starvation prevention which does not work very well if there are only very few worker slots (for a certain worker class) available.

We currently only have 4 idle/free worker slots that are capable of running jobs with worker class qemu_ppc64le. The scheduler log the following messages about them:

martchus@openqa:~> tail -f /var/log/openqa_scheduler | grep -i 'holding worker'
…
[2023-09-13T10:46:09.216709+02:00] [debug] [pid:1553] Holding worker 887 for job 12082558 to avoid starvation (cluster A)
[2023-09-13T10:46:09.216741+02:00] [debug] [pid:1553] Holding worker 885 for job 12082559 to avoid starvation (cluster A)
[2023-09-13T10:46:09.216829+02:00] [debug] [pid:1553] Holding worker 914 for job 12082561 to avoid starvation (cluster B)
[2023-09-13T10:46:09.216929+02:00] [debug] [pid:1553] Holding worker 898 for job 12095650 to avoid starvation (cluster A)

I appended the (cluster …) braces to show which jobs are in the same cluster. It looks the same on subsequent scheduler ticks.

The cluster that I called "cluster A" really only consists of the three jobs that are mentioned in the log lines I've pasted. None of the jobs is blocked by other dependencies. So normally one would expect the cluster to be scheduled (using the workers that were held back). Maybe we have a one-off error here? Maybe the fact that the only other free worker slot is held back for another cluster is problematic?


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #135578: Long job age and jobs not executed for long size:MResolvednicksinger

Actions
Actions

Also available in: Atom PDF