action #137603: [alert] Queue: State (SUSE) - too few jobs executed alert size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #137603

closed

[alert] Queue: State (SUSE) - too few jobs executed alert size:S

Added by jbaier_cz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-10-09

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

Queue: State (SUSE) - too few jobs executed alert

View alert
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/ad8b5de6-d5ca-43e0-b734-6289739bc2d8/view?orgId=1

Summary
Too few openQA jobs are executed
Description
Not enough openQA jobs are assigned to workers and executed while many
scheduled jobs exist in the scheduled state.

see https://progress.opensuse.org/issues/135122 for details

Values

E0=17 E1=273

Labels
alertname Queue: State (SUSE) - too few jobs executed alert
grafana_folder Salt

Silence
http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DQueue%3A+State+%28SUSE%29+-+too+few+jobs+executed+alert&matcher=grafana_folder%3DSalt&orgId=1

View dashboard http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz?orgId=1

View panel
http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz?orgId=1&viewPanel=9

Observed 38s before this notification was delivered, at 2023-10-09
04:30:00 +0200 CEST

Might be related to #137600, might be just a normal behavior over weekend. The alert is currently in a resolved state.

Suggestions¶

"too few jobs" means < 20 running, > 100 scheduled - are these numbers sensible? Reduce/increase?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by tinita over 1 year ago

from /var/log/openqa_scheduler around that time:

[2023-10-09T04:29:07.965461+02:00] [debug] [pid:1683] Scheduling: Free workers: 927/949; Scheduled jobs: 273
[2023-10-09T04:29:08.614353+02:00] [debug] [pid:1683] Skipping 273 jobs because of no free workers for requested worker classes (qemu_ppc64le,tap:107,spvm_ppc64le:86,hmc_ppc64le-1disk:36,svirt-vmware:22,qemu_ppc64le-large-mem,tap:8,openqaworker14,qemu_x86_64,tap:3,hmc_ppc64le:2,hmc_ppc64le_sap:2,nue,qemu_x86_64,tap:2,prg_office,qemu_ppc64le,qemu_x86_64:2,qemu_x86_64_no_tmpfs:2,hmc_ppc64le-4disk:1)

And that simply goes on until now. We apparently just didn't have very much jobs that actually found a matching worker.

Actions

Copy link

Updated by kraih over 1 year ago

I just made the same observation while manually reviewing monitoring data. The system has actually been very responsive over the weekend and picked up new jobs almost immediately. This is exactly how it should be.

Actions

Copy link

Updated by kraih over 1 year ago

Status changed from New to Resolved
Assignee set to kraih

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Resolved to New

Then we should ensure that there is no alert in such cases, right?

Actions

Copy link

Updated by kraih over 1 year ago

okurz wrote in #note-4:

Then we should ensure that there is no alert in such cases, right?

The real problem is the 273 stuck scheduled jobs, which seem to be mostly ppc64le variations with no available workers. So the alert itself is valid, but there are other tickets about the problem already.

Actions

Copy link

Updated by kraih over 1 year ago

Assignee deleted (~~kraih~~)

Putting the ticket back into the backlog for estimation then, since the scope is too vague at the moment.

Actions

Copy link

Updated by livdywan over 1 year ago

Subject changed from [alert] Queue: State (SUSE) - too few jobs executed alert to [alert] Queue: State (SUSE) - too few jobs executed alert size:S
Description updated (diff)
Status changed from New to Workable