action #137603
closed[alert] Queue: State (SUSE) - too few jobs executed alert size:S
0%
Description
Observation¶
Queue: State (SUSE) - too few jobs executed alert
Summary
Too few openQA jobs are executed
Description
Not enough openQA jobs are assigned to workers and executed while many
scheduled jobs exist in the scheduled state.see https://progress.opensuse.org/issues/135122 for details
Values
E0=17 E1=273
Labels
alertname Queue: State (SUSE) - too few jobs executed alert
grafana_folder SaltView dashboard http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz?orgId=1
View panel
http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz?orgId=1&viewPanel=9Observed 38s before this notification was delivered, at 2023-10-09
04:30:00 +0200 CEST
Might be related to #137600, might be just a normal behavior over weekend. The alert is currently in a resolved state.
Suggestions¶
- "too few jobs" means < 20 running, > 100 scheduled - are these numbers sensible? Reduce/increase?
Updated by tinita about 1 year ago
from /var/log/openqa_scheduler around that time:
[2023-10-09T04:29:07.965461+02:00] [debug] [pid:1683] Scheduling: Free workers: 927/949; Scheduled jobs: 273
[2023-10-09T04:29:08.614353+02:00] [debug] [pid:1683] Skipping 273 jobs because of no free workers for requested worker classes (qemu_ppc64le,tap:107,spvm_ppc64le:86,hmc_ppc64le-1disk:36,svirt-vmware:22,qemu_ppc64le-large-mem,tap:8,openqaworker14,qemu_x86_64,tap:3,hmc_ppc64le:2,hmc_ppc64le_sap:2,nue,qemu_x86_64,tap:2,prg_office,qemu_ppc64le,qemu_x86_64:2,qemu_x86_64_no_tmpfs:2,hmc_ppc64le-4disk:1)
And that simply goes on until now. We apparently just didn't have very much jobs that actually found a matching worker.
Updated by kraih about 1 year ago
I just made the same observation while manually reviewing monitoring data. The system has actually been very responsive over the weekend and picked up new jobs almost immediately. This is exactly how it should be.
Updated by kraih about 1 year ago
- Status changed from New to Resolved
- Assignee set to kraih
Updated by okurz about 1 year ago
- Status changed from Resolved to New
Then we should ensure that there is no alert in such cases, right?
Updated by kraih about 1 year ago
okurz wrote in #note-4:
Then we should ensure that there is no alert in such cases, right?
The real problem is the 273 stuck scheduled jobs, which seem to be mostly ppc64le
variations with no available workers. So the alert itself is valid, but there are other tickets about the problem already.
Updated by kraih about 1 year ago
- Assignee deleted (
kraih)
Putting the ticket back into the backlog for estimation then, since the scope is too vague at the moment.
Updated by livdywan about 1 year ago
- Subject changed from [alert] Queue: State (SUSE) - too few jobs executed alert to [alert] Queue: State (SUSE) - too few jobs executed alert size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 1 year ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Updated by okurz about 1 year ago
- Related to action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added
Updated by okurz about 1 year ago
Updated by okurz about 1 year ago
- Status changed from In Progress to Resolved
MR merged. No alert right now.