action #98667
closedUnhandled [Alerting] Queue: State (SUSE) alert for > 4h size:M
0%
Description
Updated by mkittler about 3 years ago
- Assignee set to mkittler
Total number of scheduled jobs:
openqa=> select count(id) from jobs where state = 'scheduled';
count
-------
9363
(1 Zeile)
About 10 % of them have been restarted:
openqa=> select count(j1.id) from jobs as j1 where state = 'scheduled' and (select j2.id from jobs as j2 where j1.id = j2.clone_id limit 1) is not null;
count
-------
925
(1 Zeile)
It came up in the chat that many live patching jobs have been scheduled but I found only 120 of those jobs (in the queue):
openqa=> select count(id) from jobs where state = 'scheduled' and TEST like '%live-patch%';
count
-------
120
(1 Zeile)
Updated by okurz about 3 years ago
- Copied to action #98673: [retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner size:S added
Updated by okurz about 3 years ago
- Status changed from Workable to In Progress
Investigated further with mkittler.
As visible in https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=9&from=1631617176679&to=1631619288893 the first big rise in number of "Maintenance: Single Incidents" happened at 2021-09-14 11:06:00Z. SLE incident tests are scheduled by https://gitlab.suse.de/qa-maintenance/bot-ng/ and on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines?page=6&scope=all we can find one "schedule incident" job https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/583038 which took an unusual long 30 minutes where from https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines I could see that it normally takes roughly 5 minutes to complete. So likely during these 30 minutes bot-ng was very busy to schedule the way too high number of tests.
Taking a look in the OSD database I see:
openqa=> select count(id) from jobs where t_created >= '2021-09-14 11:00' and t_created <= '2021-09-14 12:00';
count
-------
4954
(1 row)
openqa=> select count(id),build from jobs where t_created >= '2021-09-14 11:00' and t_created <= '2021-09-14 12:00' group by build order by count(id) desc;
count | build
-------+-----------------------------------------------------------
150 | :20372:gnu-compilers-hpc
125 | :21149:kgraft-patch-SLE12-SP5_Update_21
125 | :21146:kgraft-patch-SLE12-SP5_Update_18
125 | :21150:kgraft-patch-SLE12-SP5_Update_22
122 | :21190:kernel-livepatch-SLE15-SP3_Update_5
122 | :21185:kernel-livepatch-SLE15-SP3_Update_0
122 | :21186:kernel-livepatch-SLE15-SP3_Update_1
122 | :21189:kernel-livepatch-SLE15-SP3_Update_4
122 | :21188:kernel-livepatch-SLE15-SP3_Update_3
122 | :21187:kernel-livepatch-SLE15-SP3_Update_2
107 | :21176:kernel-livepatch-SLE15-SP2_Update_10
107 | :21177:kernel-livepatch-SLE15-SP2_Update_11
107 | :21178:kernel-livepatch-SLE15-SP2_Update_12
107 | :21184:kernel-livepatch-SLE15-SP2_Update_18
107 | :21183:kernel-livepatch-SLE15-SP2_Update_17
107 | :21181:kernel-livepatch-SLE15-SP2_Update_15
107 | :21182:kernel-livepatch-SLE15-SP2_Update_16
107 | :21180:kernel-livepatch-SLE15-SP2_Update_14
107 | :21179:kernel-livepatch-SLE15-SP2_Update_13
107 | :21169:kernel-livepatch-SLE15-SP2_Update_3
107 | :21170:kernel-livepatch-SLE15-SP2_Update_4
107 | :21171:kernel-livepatch-SLE15-SP2_Update_5
107 | :21172:kernel-livepatch-SLE15-SP2_Update_6
107 | :21173:kernel-livepatch-SLE15-SP2_Update_7
107 | :21174:kernel-livepatch-SLE15-SP2_Update_8
107 | :21175:kernel-livepatch-SLE15-SP2_Update_9
98 | :21165:kernel-livepatch-SLE15-SP1_Update_23
98 | :21151:kernel-livepatch-SLE15_Update_20
98 | :21152:kernel-livepatch-SLE15_Update_21
98 | :21153:kernel-livepatch-SLE15_Update_22
98 | :21154:kernel-livepatch-SLE15_Update_23
98 | :21155:kernel-livepatch-SLE15_Update_24
98 | :21156:kernel-livepatch-SLE15_Update_25
98 | :21157:kernel-livepatch-SLE15-SP1_Update_15
98 | :21158:kernel-livepatch-SLE15-SP1_Update_16
98 | :21159:kernel-livepatch-SLE15-SP1_Update_17
98 | :21160:kernel-livepatch-SLE15-SP1_Update_18
98 | :21161:kernel-livepatch-SLE15-SP1_Update_19
98 | :21162:kernel-livepatch-SLE15-SP1_Update_20
98 | :21163:kernel-livepatch-SLE15-SP1_Update_21
98 | :21164:kernel-livepatch-SLE15-SP1_Update_22
98 | :21166:kernel-livepatch-SLE15-SP1_Update_24
98 | :21167:kernel-livepatch-SLE15-SP1_Update_25
98 | :21168:kernel-livepatch-SLE15-SP1_Update_26
38 | :20912:fonttosfnt
38 | :20989:krb5
26 | :20645:openhpi
22 | :19315:sapnwbootstrap-formula
20 | :20949:rpmlint-mini
20 | :20956:libyui
14 | :20990:rabbitmq-server
14 | :20272:389-ds
…
(60 rows)
and
select count(id) from jobs where t_created >= '2021-09-14 11:00' and t_created <= '2021-09-14 12:00' and build like '%livepatch%';
count
-------
4208
(1 row)
meaning that we had 4.2k livepatch jobs (+375 kgraft jobs) scheduled. Now the question is: Is this expected? If yes, then we could adjust the alert to a much higher level and tell everyone that jobs will just take their time in such cases.
mmeissn confirmed in https://suse.slack.com/archives/C02CANHLANP/p1631707641265700?thread_ts=1631695768.260300&cid=C02CANHLANP that this is expected from his point of view:
i would say with the current setup: yes, expected. ... does not mean it could be improved
A later higher flank triggered more tests:
openqa=> select count(id) from jobs where t_created >= '2021-09-14 17:00' and t_created <= '2021-09-14 18:00' and (build like '%livepatch%' or build like '%kgraft%');
count
-------
7837
(1 row)
confirms that this is again livepatch and kgraft. Asked again in https://suse.slack.com/archives/C02CANHLANP/p1631707782266500?thread_ts=1631695768.260300&cid=C02CANHLANP to confirm that also this is expected.
mmeissn confirmed that as well:
So... this is something that might be avoided. What happens:
We process the livepatches, but 50% or so of them have the wrong signing key first. We change the signing key and rebuild them, but there is a window of some hours between that.
In that time the openq engine has already started the first batch.
Marcus Meissner 26 minutes ago
The signing key of the patches for kernel older than March 2021 are affected, as we have a 1 year sliding window, this number will decrease continously.
Marcus Meissner 26 minutes ago
(Unless we have to rotate the signing key again)
So I will accept the current behaviour as "expected behaviour" and improve the alert thresholds accordingly.
Together with mkittler we are adjusting the grafana panel and alert thresholds. I also want to include a suggestion how to find the ratio of clones vs. all scheduled jobs and the following works in SQL:
select round ((select count(j1.id) from jobs as j1 where state = 'scheduled' and (select j2.id from jobs as j2 where j1.id = j2.clone_id limit 1) is not null) * 100. / (select count(id) from jobs where state = 'scheduled'), 2) as "percentage of clones in all scheduled jobs";
percentage of clones in all scheduled jobs
--------------------------------------------
10.11
(1 row)
Updated by okurz about 3 years ago
Updated by okurz about 3 years ago
- Description updated (diff)
- Due date set to 2021-09-29
- Status changed from In Progress to Feedback
- Priority changed from Immediate to Urgent
- Reduced prio after we clarified that we are in an expected situation, nothing really broken
- Added rollback steps "unpause alert"
still to be done:
- There is already #98673 but besides that we should still find improvement steps: https://progress.opensuse.org/projects/qa/wiki#How-we-work-on-our-backlog
Updated by mkittler about 3 years ago
I unpaused the alert after the SR has been merged.
Not sure what to improve further besides #98673 since with the SR we have already improved the investigation steps for the alert. Judging from the responses in the chat this number of kernel live patching jobs is expected so I suppose we cannot do much to prevent it in the future. Considering live patching is important is wouldn't likely even make sense to deprioritize those jobs.
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved
mkittler wrote:
Considering live patching is important is wouldn't likely even make sense to deprioritize those jobs.
Well, "important" does not equal "urgent". To clarify with maintsec if maybe we can tweak priority I created #98736 because what I see right now is that tests like https://openqa.suse.de/tests/7120218 have a prio value of 50 which AFAIK is default. So maybe no one thought about setting a different value.
With the two follow-up tickets I also consider this enough of improvements, resolved.