Project

General

Profile

Actions

action #137603

closed

[alert] Queue: State (SUSE) - too few jobs executed alert size:S

Added by jbaier_cz 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-10-09
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Queue: State (SUSE) - too few jobs executed alert

View alert
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/ad8b5de6-d5ca-43e0-b734-6289739bc2d8/view?orgId=1

Summary
Too few openQA jobs are executed
Description
Not enough openQA jobs are assigned to workers and executed while many
scheduled jobs exist in the scheduled state.

see https://progress.opensuse.org/issues/135122 for details

Values

E0=17 E1=273

Labels
alertname Queue: State (SUSE) - too few jobs executed alert
grafana_folder Salt

Silence
http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DQueue%3A+State+%28SUSE%29+-+too+few+jobs+executed+alert&matcher=grafana_folder%3DSalt&orgId=1

View dashboard http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz?orgId=1

View panel
http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz?orgId=1&viewPanel=9

Observed 38s before this notification was delivered, at 2023-10-09
04:30:00 +0200 CEST

Might be related to #137600, might be just a normal behavior over weekend. The alert is currently in a resolved state.

Suggestions

  • "too few jobs" means < 20 running, > 100 scheduled - are these numbers sensible? Reduce/increase?

Related issues 1 (0 open1 closed)

Related to openQA Tests - action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:MResolvedmkittler2023-09-20

Actions
Actions #1

Updated by tinita 7 months ago

from /var/log/openqa_scheduler around that time:

[2023-10-09T04:29:07.965461+02:00] [debug] [pid:1683] Scheduling: Free workers: 927/949; Scheduled jobs: 273
[2023-10-09T04:29:08.614353+02:00] [debug] [pid:1683] Skipping 273 jobs because of no free workers for requested worker classes (qemu_ppc64le,tap:107,spvm_ppc64le:86,hmc_ppc64le-1disk:36,svirt-vmware:22,qemu_ppc64le-large-mem,tap:8,openqaworker14,qemu_x86_64,tap:3,hmc_ppc64le:2,hmc_ppc64le_sap:2,nue,qemu_x86_64,tap:2,prg_office,qemu_ppc64le,qemu_x86_64:2,qemu_x86_64_no_tmpfs:2,hmc_ppc64le-4disk:1)

And that simply goes on until now. We apparently just didn't have very much jobs that actually found a matching worker.

Actions #2

Updated by kraih 7 months ago

I just made the same observation while manually reviewing monitoring data. The system has actually been very responsive over the weekend and picked up new jobs almost immediately. This is exactly how it should be.

Actions #3

Updated by kraih 7 months ago

  • Status changed from New to Resolved
  • Assignee set to kraih
Actions #4

Updated by okurz 7 months ago

  • Status changed from Resolved to New

Then we should ensure that there is no alert in such cases, right?

Actions #5

Updated by kraih 7 months ago

okurz wrote in #note-4:

Then we should ensure that there is no alert in such cases, right?

The real problem is the 273 stuck scheduled jobs, which seem to be mostly ppc64le variations with no available workers. So the alert itself is valid, but there are other tickets about the problem already.

Actions #6

Updated by kraih 7 months ago

  • Assignee deleted (kraih)

Putting the ticket back into the backlog for estimation then, since the scope is too vague at the moment.

Actions #7

Updated by livdywan 7 months ago

  • Subject changed from [alert] Queue: State (SUSE) - too few jobs executed alert to [alert] Queue: State (SUSE) - too few jobs executed alert size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by okurz 7 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #9

Updated by okurz 7 months ago

  • Related to action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added
Actions #10

Updated by okurz 7 months ago

The actual alert is due to #136130

Actions #12

Updated by okurz 7 months ago

  • Status changed from In Progress to Resolved

MR merged. No alert right now.

Actions

Also available in: Atom PDF