action #135578
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert
Long job age and jobs not executed for long size:M
Added by okurz about 1 year ago. Updated about 1 year ago.
0%
Description
Motivation¶
Similar as in #135122 was discovered mostly due to user feedback rather than alert handling that we have long job age and jobs not executed for long. As people are waiting for their jobs to be executed for various products we should ensure short-term mitigations are applied to handle the situation while in the background we fix the underlying problems.
Acceptance criteria¶
- AC1: https://monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1&tab=alert&from=1694146517954&to=1694507672085&viewPanel=2 is significantly below the alerting threshold
Suggestions¶
- Look at the end of of scheduled jobs on https://openqa.suse.de/tests/ and identify why jobs are not picked up in a timely manner
Updated by okurz about 1 year ago
- Copied from action #135380: A significant number of scheduled jobs with one or two running triggers an alert added
Updated by tinita about 1 year ago
I just came up with a query to get the worker classes of the currently waiting jobs:
openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-12 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 20;
wc | count
------------------------------------+-------
qemu_x86_64,tap | 3283
qemu_ppc64le | 2389
s390-kvm-sle12 | 545
qemu_x86_64-large-mem,tap,worker9 | 224
spvm_ppc64le | 164
qemu_x86_64-large-mem,tap | 64
64bit-ipmi-nvdimm | 57
qemu_x86_64,tap,worker37 | 46
qemu_x86_64,tap,tap | 46
qemu_x86_64,tap,worker29 | 45
hmc_ppc64le-1disk | 40
qemu_x86_64,tap,worker40 | 36
qemu_x86_64,tap,worker39 | 33
qemu_x86_64,qemu_x86_64,tap,tap | 22
qemu_x86_64-large-mem,tap,worker39 | 21
qemu_x86_64,tap,worker30 | 18
qemu_x86_64-large-mem,tap,worker38 | 12
qemu_x86_64-large-mem,tap,worker40 | 12
qemu_x86_64-large-mem,tap,worker30 | 12
qemu_x86_64,tap,worker31 | 11
(20 rows)
Looking for jobs older than 3 days ago, there is even a more clear picture:
openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-09 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 10;
wc | count
------------------------------------+-------
qemu_ppc64le | 2162
qemu_x86_64-large-mem,tap,worker9 | 224
spvm_ppc64le | 162
qemu_x86_64,tap | 112
hmc_ppc64le-1disk | 40
64bit-ipmi-nvdimm | 9
hmc_ppc64le-4disk | 5
qemu_x86_64-large-mem,tap,worker36 | 4
qemu_x86_64-large-mem,tap,worker35 | 4
qemu_x86_64,tap,worker31 | 3
(10 rows)
edit: sorted the string_agg additionally to not get multiple entries for the same combination of worker classes
Updated by tinita about 1 year ago
Looking at qemu_ppc64le
specifically, I investigated how many of those jobs are investigation jobs:
openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test like '%:investigate:retry%' ;
wc
----
80
(1 row)
openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test like '%:investigate:%' ;
wc
-----
179
(1 row)
openqa=> select count(*) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and js.value='qemu_ppc64le' and j.t_created < '2023-09-12 10:00:00' and test not like '%:investigate:%' ;
wc
------
2144
(1 row)
Updated by okurz about 1 year ago
- Related to action #134927: OSD throws 503, unresponsive for some minutes size:M added
Updated by okurz about 1 year ago
- Priority changed from Urgent to Immediate
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-12h&to=now&viewPanel=9 shows that for multiple hours OSD has 50 jobs running in parallel and the job schedule only very slowly decreases, too slowly. Please look into that.
Updated by tinita about 1 year ago
Current worker class statistics:
openqa=> select waiting.wc, count(*) from (select string_agg(js.value, ',' order by js.value) as wc from jobs j join job_settings js on j.id=js.job_id where j.state = 'scheduled' and js.key='WORKER_CLASS' and j.t_created < '2023-09-13 10:00:00' group by j.id ) as waiting group by waiting.wc order by count(*) desc limit 10;
wc | count
-----------------------------------+-------
qemu_ppc64le | 1920
qemu_x86_64,tap | 1119
s390-kvm-sle12 | 524
qemu_x86_64-large-mem,tap,worker9 | 216
spvm_ppc64le | 164
64bit-ipmi-nvdimm | 79
hmc_ppc64le-1disk | 76
qemu_x86_64-large-mem,tap | 68
qemu_x86_64,tap,tap | 46
qemu_x86_64,tap,worker37 | 46
(10 rows)
Regarding qemu_ppc64le:
We currently only have one worker: powerqaworker-qam-1
malbec hasn't accepted jobs for the last 7 hours, but it's listed as idle.
spvm_ppc64le: No workers at all
Updated by okurz about 1 year ago
- Copied to action #135644: Long job age and jobs not executed for long - malbec not working on jobs since 2023-09-13 - scheduler reserving slots for multi-machine clusters which never come added
Updated by nicksinger about 1 year ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by nicksinger about 1 year ago
checked again, jobs are worked on slowly but the numbers go down. Biggest "problem" seems to be "qemu_ppc64le" where the numbers rather go up again:
wc | count
-----------------------------------+-------
qemu_ppc64le | 2132
qemu_x86_64,tap | 1084
s390-kvm-sle12 | 484
qemu_x86_64-large-mem,tap,worker9 | 216
spvm_ppc64le | 164
hmc_ppc64le-1disk | 76
64bit-ipmi-nvdimm | 75
qemu_x86_64-large-mem,tap | 64
qemu_x86_64,tap,tap | 46
qemu_x86_64,tap,worker37 | 46
Updated by nicksinger about 1 year ago
qemu_x86_64,tap is a situation we might be able to improve if we get some more MM machines up and running
Updated by tinita about 1 year ago
https://github.com/os-autoinst/openQA/pull/5306 scheduler: Log statistics of rejected jobs
Updated by nicksinger about 1 year ago
wc | count
---------------------------+-------
qemu_ppc64le | 2006
qemu_x86_64,tap | 1191
qemu_aarch64,tap | 386
s390-kvm-sle12 | 350
spvm_ppc64le | 162
64bit-ipmi-nvdimm | 71
s390-kvm,s390-kvm-sle12 | 49
qemu_x86_64,tap,worker37 | 46
qemu_x86_64,tap,worker29 | 45
qemu_x86_64-large-mem,tap | 38
As expected all of the oldest jobs on OSD are currently ppc-based. Some of them have very specific worker-classes which are partly only available on single machines which are currently offline. I canceled everything which would not run anyway.
Looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now&viewPanel=12 all peaks per architecture get worked on in the same rate so it is safe to assume that available resources are used as much as possible (drop of rate is e.g. clearly visible for ppc64).
Also realized that we get a lot of new jobs by openqa-investigate. Asked in slack (https://suse.slack.com/archives/C02AJ1E568M/p1694624128367029) if we can and should disable it for the time being.
Updated by okurz about 1 year ago
- Priority changed from Immediate to Urgent
Thank you Nick for your thorough checking. It seems everything is working as designed and besides the mitigations that you have applied the best we can do is wait and continue to monitor. Reducing prio accordingly
Updated by openqa_review about 1 year ago
- Due date set to 2023-09-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan about 1 year ago
- Subject changed from Long job age and jobs not executed for long to Long job age and jobs not executed for long size:M
Updated by okurz about 1 year ago
- Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Updated by okurz about 1 year ago
- Related to action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressources added
Updated by okurz about 1 year ago
I looked today in the morning. One case I found is
openqa=> select count(jobs.id) from jobs join job_settings on jobs.id = job_settings.job_id where state='scheduled' and key='WORKER_CLASS' and value='s390-kvm-sle12';
count
-------
494
I already proposed to move to the generic "s390-kvm" class. For this we have #127523
Updated by nicksinger about 1 year ago
- Status changed from In Progress to Resolved
Looked again and no problematic change. Still the usual suspects with s390 working hard on lowering the jobs, mm being the biggest queue. So I think the immediate steps to verify our infrastructure is working as expected (just with lower resources) can be concluded and we have followup tickets for specific workers and worker classes.