action #135644: Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #135644

open

Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

New

Priority:

Low

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2023-09-13

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

https://openqa.suse.de/admin/workers/885 shows last job from 2023-09-13 early morning but many qemu-ppc64le are currently scheduled

The problem is not that any of the services is not responding. The worker slots and the scheduler and websocket server all look good. It looks like the source of the problem is the starvation prevention which does not work very well if there are only very few worker slots (for a certain worker class) available.

We currently only have 4 idle/free worker slots that are capable of running jobs with worker class qemu_ppc64le. The scheduler log the following messages about them:

martchus@openqa:~> tail -f /var/log/openqa_scheduler | grep -i 'holding worker'
…
[2023-09-13T10:46:09.216709+02:00] [debug] [pid:1553] Holding worker 887 for job 12082558 to avoid starvation (cluster A)
[2023-09-13T10:46:09.216741+02:00] [debug] [pid:1553] Holding worker 885 for job 12082559 to avoid starvation (cluster A)
[2023-09-13T10:46:09.216829+02:00] [debug] [pid:1553] Holding worker 914 for job 12082561 to avoid starvation (cluster B)
[2023-09-13T10:46:09.216929+02:00] [debug] [pid:1553] Holding worker 898 for job 12095650 to avoid starvation (cluster A)

I appended the (cluster …) braces to show which jobs are in the same cluster. It looks the same on subsequent scheduler ticks.

The cluster that I called "cluster A" really only consists of the three jobs that are mentioned in the log lines I've pasted. None of the jobs is blocked by other dependencies. So normally one would expect the cluster to be scheduled (using the workers that were held back). Maybe we have a one-off error here? Maybe the fact that the only other free worker slot is held back for another cluster is problematic?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Copied from action #135578: Long job age and jobs not executed for long size:M added

Actions

Copy link

Updated by mkittler over 1 year ago

Description updated (diff)

I checked what's going on and extended the observation.

Actions

Copy link

Updated by openqa_review over 1 year ago

Due date set to 2023-09-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz over 1 year ago

Due date deleted (~~2023-09-28~~)
Status changed from In Progress to Resolved

More recent jobs ran again on malbec, e.g. https://openqa.suse.de/tests/12047996

mkittler wrote:

We currently only have 4 worker slots that are capable of running jobs with worker class qemu_ppc64le.

That is not correct. There is also powerqaworker-qam-1.

@mkittler thanks for the investigation and explanation. I have added an idea in #65271 to forward that information that is present in the log line also to the webUI status of the worker.

Actions

Copy link

Updated by mkittler over 1 year ago

Description updated (diff)

That is not correct. There is also powerqaworker-qam-1.

I was filtering for idle workers. So at the time only those 4 slots were idle. I corrected ticket description.

@mkittler thanks for the investigation and explanation. I have added an idea in #65271 to forward that information that is present in the log line also to the webUI status of the worker.

I only see the suggestion

worker status: Add information about "hold back" workers in the webUI, there is already a log line. See #135644 for details

on that page. But I think this issue is not just about displaying the status in the web UI. There is a real problem here that we should fix because jobs that should have been assigned haven't been assigned (e.g. there might be a one-off error in the scheduler code for holding back workers). The only thing I haven't checked is whether we might have simply run into the total limit for running jobs. (But if it was just that then there's likely nothing to improve at all.)

Actions

Copy link

Updated by okurz over 1 year ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Subject changed from Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 to Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come
Category set to Feature requests
Status changed from Resolved to New
Assignee deleted (~~okurz~~)
Priority changed from High to Low
Target version changed from Ready to future

ok, the problem is actually still reproducible and we should keep a reference for the issue as long as it's not fixed for good

Actions

Copy link

Updated by okurz over 1 year ago

Parent task deleted (~~#135122~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #135644

Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come

Observation¶

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by openqa_review over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago