Project

General

Profile

Actions

action #135578

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

Long job age and jobs not executed for long size:M

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Similar as in #135122 was discovered mostly due to user feedback rather than alert handling that we have long job age and jobs not executed for long. As people are waiting for their jobs to be executed for various products we should ensure short-term mitigations are applied to handle the situation while in the background we fix the underlying problems.

Acceptance criteria

Suggestions


Related issues 5 (1 open4 closed)

Related to openQA Infrastructure - action #134927: OSD throws 503, unresponsive for some minutes size:MResolvedokurz2023-08-31

Actions
Related to openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Related to openQA Infrastructure - action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressourcesResolvedmgrifalconi

Actions
Copied from openQA Infrastructure - action #135380: A significant number of scheduled jobs with one or two running triggers an alertResolvedokurz2023-09-07

Actions
Copied to openQA Project - action #135644: Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never comeNew2023-09-13

Actions
Actions #1

Updated by okurz about 1 year ago

  • Copied from action #135380: A significant number of scheduled jobs with one or two running triggers an alert added
Actions #2

Updated by tinita about 1 year ago

I just came up with a query to get the worker classes of the currently waiting jobs:

openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-12 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 20;                                                                      
                 wc                 | count 
------------------------------------+-------
 qemu_x86_64,tap                    |  3283
 qemu_ppc64le                       |  2389
 s390-kvm-sle12                     |   545
 qemu_x86_64-large-mem,tap,worker9  |   224
 spvm_ppc64le                       |   164
 qemu_x86_64-large-mem,tap          |    64
 64bit-ipmi-nvdimm                  |    57
 qemu_x86_64,tap,worker37           |    46
 qemu_x86_64,tap,tap                |    46
 qemu_x86_64,tap,worker29           |    45
 hmc_ppc64le-1disk                  |    40
 qemu_x86_64,tap,worker40           |    36
 qemu_x86_64,tap,worker39           |    33
 qemu_x86_64,qemu_x86_64,tap,tap    |    22
 qemu_x86_64-large-mem,tap,worker39 |    21
 qemu_x86_64,tap,worker30           |    18
 qemu_x86_64-large-mem,tap,worker38 |    12
 qemu_x86_64-large-mem,tap,worker40 |    12
 qemu_x86_64-large-mem,tap,worker30 |    12
 qemu_x86_64,tap,worker31           |    11
(20 rows)

Looking for jobs older than 3 days ago, there is even a more clear picture:

openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-09 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 10;                                                                      
                 wc                 | count 
------------------------------------+-------
 qemu_ppc64le                       |  2162
 qemu_x86_64-large-mem,tap,worker9  |   224
 spvm_ppc64le                       |   162
 qemu_x86_64,tap                    |   112
 hmc_ppc64le-1disk                  |    40
 64bit-ipmi-nvdimm                  |     9
 hmc_ppc64le-4disk                  |     5
 qemu_x86_64-large-mem,tap,worker36 |     4
 qemu_x86_64-large-mem,tap,worker35 |     4
 qemu_x86_64,tap,worker31           |     3
(10 rows)

edit: sorted the string_agg additionally to not get multiple entries for the same combination of worker classes

Actions #3

Updated by tinita about 1 year ago

Looking at qemu_ppc64le specifically, I investigated how many of those jobs are investigation jobs:

openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test like '%:investigate:retry%' ;
 wc 
----
 80
(1 row)

openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test like '%:investigate:%' ;
 wc  
-----
 179
(1 row)

openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test not like '%:investigate:%' ;
  wc  
------
 2144
(1 row)
Actions #4

Updated by okurz about 1 year ago

  • Related to action #134927: OSD throws 503, unresponsive for some minutes size:M added
Actions #5

Updated by okurz about 1 year ago

  • Priority changed from Urgent to Immediate

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-12h&to=now&viewPanel=9 shows that for multiple hours OSD has 50 jobs running in parallel and the job schedule only very slowly decreases, too slowly. Please look into that.

Actions #6

Updated by tinita about 1 year ago

Current worker class statistics:

openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-13 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 10;
                wc                 | count 
-----------------------------------+-------
 qemu_ppc64le                      |  1920
 qemu_x86_64,tap                   |  1119
 s390-kvm-sle12                    |   524
 qemu_x86_64-large-mem,tap,worker9 |   216
 spvm_ppc64le                      |   164
 64bit-ipmi-nvdimm                 |    79
 hmc_ppc64le-1disk                 |    76
 qemu_x86_64-large-mem,tap         |    68
 qemu_x86_64,tap,tap               |    46
 qemu_x86_64,tap,worker37          |    46
(10 rows)

Regarding qemu_ppc64le:
We currently only have one worker: powerqaworker-qam-1
malbec hasn't accepted jobs for the last 7 hours, but it's listed as idle.

spvm_ppc64le: No workers at all

Actions #7

Updated by okurz about 1 year ago

  • Copied to action #135644: Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come added
Actions #8

Updated by nicksinger about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #9

Updated by nicksinger about 1 year ago

checked again, jobs are worked on slowly but the numbers go down. Biggest "problem" seems to be "qemu_ppc64le" where the numbers rather go up again:

                wc                 | count
-----------------------------------+-------
 qemu_ppc64le                      |  2132
 qemu_x86_64,tap                   |  1084
 s390-kvm-sle12                    |   484
 qemu_x86_64-large-mem,tap,worker9 |   216
 spvm_ppc64le                      |   164
 hmc_ppc64le-1disk                 |    76
 64bit-ipmi-nvdimm                 |    75
 qemu_x86_64-large-mem,tap         |    64
 qemu_x86_64,tap,tap               |    46
 qemu_x86_64,tap,worker37          |    46
Actions #10

Updated by nicksinger about 1 year ago

qemu_x86_64,tap is a situation we might be able to improve if we get some more MM machines up and running

Actions #11

Updated by tinita about 1 year ago

https://github.com/os-autoinst/openQA/pull/5306 scheduler: Log statistics of rejected jobs

Actions #12

Updated by nicksinger about 1 year ago

            wc             | count
---------------------------+-------
 qemu_ppc64le              |  2006
 qemu_x86_64,tap           |  1191
 qemu_aarch64,tap          |   386
 s390-kvm-sle12            |   350
 spvm_ppc64le              |   162
 64bit-ipmi-nvdimm         |    71
 s390-kvm,s390-kvm-sle12   |    49
 qemu_x86_64,tap,worker37  |    46
 qemu_x86_64,tap,worker29  |    45
 qemu_x86_64-large-mem,tap |    38

As expected all of the oldest jobs on OSD are currently ppc-based. Some of them have very specific worker-classes which are partly only available on single machines which are currently offline. I canceled everything which would not run anyway.
Looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now&viewPanel=12 all peaks per architecture get worked on in the same rate so it is safe to assume that available resources are used as much as possible (drop of rate is e.g. clearly visible for ppc64).
Also realized that we get a lot of new jobs by openqa-investigate. Asked in slack (https://suse.slack.com/archives/C02AJ1E568M/p1694624128367029) if we can and should disable it for the time being.

Actions #13

Updated by okurz about 1 year ago

  • Priority changed from Immediate to Urgent

Thank you Nick for your thorough checking. It seems everything is working as designed and besides the mitigations that you have applied the best we can do is wait and continue to monitor. Reducing prio accordingly

Actions #14

Updated by openqa_review about 1 year ago

  • Due date set to 2023-09-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by livdywan about 1 year ago

  • Subject changed from Long job age and jobs not executed for long to Long job age and jobs not executed for long size:M
Actions #16

Updated by okurz about 1 year ago

  • Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Actions #18

Updated by okurz about 1 year ago

  • Related to action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressources added
Actions #19

Updated by okurz about 1 year ago

I looked today in the morning. One case I found is

openqa=> select count(jobs.id) from jobs join job_settings on jobs.id = job_settings.job_id where state='scheduled' and key='WORKER_CLASS' and value='s390-kvm-sle12';
 count 
-------
   494

I already proposed to move to the generic "s390-kvm" class. For this we have #127523

Actions #20

Updated by nicksinger about 1 year ago

  • Status changed from In Progress to Resolved

Looked again and no problematic change. Still the usual suspects with s390 working hard on lowering the jobs, mm being the biggest queue. So I think the immediate steps to verify our infrastructure is working as expected (just with lower resources) can be concluded and we have followup tickets for specific workers and worker classes.

Actions #22

Updated by okurz about 1 year ago

  • Due date deleted (2023-09-28)
Actions

Also available in: Atom PDF