action #135578: Long job age and jobs not executed for long size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz over 1 year ago

Copied from action #135380: A significant number of scheduled jobs with one or two running triggers an alert added

Actions

Copy link

#2

Updated by tinita over 1 year ago

I just came up with a query to get the worker classes of the currently waiting jobs:

openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-12 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 20;                                                                      
                 wc                 | count 
------------------------------------+-------
 qemu_x86_64,tap                    |  3283
 qemu_ppc64le                       |  2389
 s390-kvm-sle12                     |   545
 qemu_x86_64-large-mem,tap,worker9  |   224
 spvm_ppc64le                       |   164
 qemu_x86_64-large-mem,tap          |    64
 64bit-ipmi-nvdimm                  |    57
 qemu_x86_64,tap,worker37           |    46
 qemu_x86_64,tap,tap                |    46
 qemu_x86_64,tap,worker29           |    45
 hmc_ppc64le-1disk                  |    40
 qemu_x86_64,tap,worker40           |    36
 qemu_x86_64,tap,worker39           |    33
 qemu_x86_64,qemu_x86_64,tap,tap    |    22
 qemu_x86_64-large-mem,tap,worker39 |    21
 qemu_x86_64,tap,worker30           |    18
 qemu_x86_64-large-mem,tap,worker38 |    12
 qemu_x86_64-large-mem,tap,worker40 |    12
 qemu_x86_64-large-mem,tap,worker30 |    12
 qemu_x86_64,tap,worker31           |    11
(20 rows)

Looking for jobs older than 3 days ago, there is even a more clear picture:

openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-09 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 10;                                                                      
                 wc                 | count 
------------------------------------+-------
 qemu_ppc64le                       |  2162
 qemu_x86_64-large-mem,tap,worker9  |   224
 spvm_ppc64le                       |   162
 qemu_x86_64,tap                    |   112
 hmc_ppc64le-1disk                  |    40
 64bit-ipmi-nvdimm                  |     9
 hmc_ppc64le-4disk                  |     5
 qemu_x86_64-large-mem,tap,worker36 |     4
 qemu_x86_64-large-mem,tap,worker35 |     4
 qemu_x86_64,tap,worker31           |     3
(10 rows)

edit: sorted the string_agg additionally to not get multiple entries for the same combination of worker classes

Actions

Copy link

#3

Updated by tinita over 1 year ago

Looking at qemu_ppc64le specifically, I investigated how many of those jobs are investigation jobs:

openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test like '%:investigate:retry%' ;
 wc 
----
 80
(1 row)

openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test like '%:investigate:%' ;
 wc  
-----
 179
(1 row)

openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test not like '%:investigate:%' ;
  wc  
------
 2144
(1 row)

Actions

Copy link

#4

Updated by okurz over 1 year ago

Related to action #134927: OSD throws 503, unresponsive for some minutes size:M added

Actions

Copy link

#5

Updated by okurz over 1 year ago

Priority changed from Urgent to Immediate

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-12h&to=now&viewPanel=9 shows that for multiple hours OSD has 50 jobs running in parallel and the job schedule only very slowly decreases, too slowly. Please look into that.

Actions

Copy link

#6

Updated by tinita over 1 year ago

Current worker class statistics:

openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-13 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 10;
                wc                 | count 
-----------------------------------+-------
 qemu_ppc64le                      |  1920
 qemu_x86_64,tap                   |  1119
 s390-kvm-sle12                    |   524
 qemu_x86_64-large-mem,tap,worker9 |   216
 spvm_ppc64le                      |   164
 64bit-ipmi-nvdimm                 |    79
 hmc_ppc64le-1disk                 |    76
 qemu_x86_64-large-mem,tap         |    68
 qemu_x86_64,tap,tap               |    46
 qemu_x86_64,tap,worker37          |    46
(10 rows)

Regarding qemu_ppc64le:
We currently only have one worker: powerqaworker-qam-1
malbec hasn't accepted jobs for the last 7 hours, but it's listed as idle.

spvm_ppc64le: No workers at all

Actions

Copy link

#7

Updated by okurz over 1 year ago

Copied to action #135644: Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come added

Actions

Copy link

#8

Updated by nicksinger over 1 year ago

Status changed from New to In Progress
Assignee set to nicksinger

Actions

Copy link

#9

Updated by nicksinger over 1 year ago

checked again, jobs are worked on slowly but the numbers go down. Biggest "problem" seems to be "qemu_ppc64le" where the numbers rather go up again:

                wc                 | count
-----------------------------------+-------
 qemu_ppc64le                      |  2132
 qemu_x86_64,tap                   |  1084
 s390-kvm-sle12                    |   484
 qemu_x86_64-large-mem,tap,worker9 |   216
 spvm_ppc64le                      |   164
 hmc_ppc64le-1disk                 |    76
 64bit-ipmi-nvdimm                 |    75
 qemu_x86_64-large-mem,tap         |    64
 qemu_x86_64,tap,tap               |    46
 qemu_x86_64,tap,worker37          |    46

Actions

Copy link

#10

Updated by nicksinger over 1 year ago

qemu_x86_64,tap is a situation we might be able to improve if we get some more MM machines up and running

Actions

Copy link

#11

Updated by tinita over 1 year ago

https://github.com/os-autoinst/openQA/pull/5306 scheduler: Log statistics of rejected jobs

Actions

Copy link

#12

Updated by nicksinger over 1 year ago

            wc             | count
---------------------------+-------
 qemu_ppc64le              |  2006
 qemu_x86_64,tap           |  1191
 qemu_aarch64,tap          |   386
 s390-kvm-sle12            |   350
 spvm_ppc64le              |   162
 64bit-ipmi-nvdimm         |    71
 s390-kvm,s390-kvm-sle12   |    49
 qemu_x86_64,tap,worker37  |    46
 qemu_x86_64,tap,worker29  |    45
 qemu_x86_64-large-mem,tap |    38

As expected all of the oldest jobs on OSD are currently ppc-based. Some of them have very specific worker-classes which are partly only available on single machines which are currently offline. I canceled everything which would not run anyway.
Looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now&viewPanel=12 all peaks per architecture get worked on in the same rate so it is safe to assume that available resources are used as much as possible (drop of rate is e.g. clearly visible for ppc64).
Also realized that we get a lot of new jobs by openqa-investigate. Asked in slack (https://suse.slack.com/archives/C02AJ1E568M/p1694624128367029) if we can and should disable it for the time being.

Actions

Copy link

#13

Updated by okurz over 1 year ago

Priority changed from Immediate to Urgent

Thank you Nick for your thorough checking. It seems everything is working as designed and besides the mitigations that you have applied the best we can do is wait and continue to monitor. Reducing prio accordingly

Actions

Copy link

#14

Updated by openqa_review over 1 year ago

Due date set to 2023-09-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#15

Updated by livdywan over 1 year ago

Subject changed from Long job age and jobs not executed for long to Long job age and jobs not executed for long size:M

Actions

Copy link

#16

Updated by okurz over 1 year ago

Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added

Actions

Copy link

#18

Updated by okurz over 1 year ago

Related to action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressources added

Actions

Copy link

#19

Updated by okurz over 1 year ago

I looked today in the morning. One case I found is

openqa=> select count(jobs.id) from jobs join job_settings on jobs.id = job_settings.job_id where state='scheduled' and key='WORKER_CLASS' and value='s390-kvm-sle12';
 count 
-------
   494

I already proposed to move to the generic "s390-kvm" class. For this we have #127523

Actions

Copy link

#20

Updated by nicksinger over 1 year ago

Status changed from In Progress to Resolved

Looked again and no problematic change. Still the usual suspects with s390 working hard on lowering the jobs, mm being the biggest queue. So I think the immediate steps to verify our infrastructure is working as expected (just with lower resources) can be concluded and we have followup tickets for specific workers and worker classes.

Actions

Copy link

#21

Updated by tinita about 1 year ago

https://github.com/os-autoinst/openQA/pull/5306 merged

Actions

Copy link

#22

Updated by okurz about 1 year ago

Due date deleted (~~2023-09-28~~)

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #135578

Long job age and jobs not executed for long size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by tinita over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by tinita about 1 year ago

Updated by okurz about 1 year ago