Project

General

Profile

Actions

action #98667

closed

Unhandled [Alerting] Queue: State (SUSE) alert for > 4h size:M

Added by livdywan over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-09-15
Due date:
2021-09-29
% Done:

0%

Estimated time:

Description

Observation

[Alerting] Queue: State (SUSE) alert

See https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&editPanel=9&tab=alert&from=now-24h&to=now

Suggestions

  • Investigate why we have so many jobs, queued/running/blocked
  • Can we trace any recent behavior/feature to this?
  • We know it's Maintenance/Single Incidents

Rollback steps

  • unpause alert after alert condition fixed

Related issues 1 (0 open1 closed)

Copied to QA - action #98673: [retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner size:SResolvedlivdywan

Actions
Actions #1

Updated by mkittler over 2 years ago

  • Assignee set to mkittler

Total number of scheduled jobs:

openqa=> select count(id) from jobs where state = 'scheduled';
 count 
-------
  9363
(1 Zeile)

About 10 % of them have been restarted:

openqa=> select count(j1.id) from jobs as j1 where state = 'scheduled' and (select j2.id from jobs as j2 where j1.id = j2.clone_id limit 1) is not null;
 count 
-------
   925
(1 Zeile)

It came up in the chat that many live patching jobs have been scheduled but I found only 120 of those jobs (in the queue):

openqa=> select count(id) from jobs where state = 'scheduled' and TEST like '%live-patch%';
 count 
-------
   120
(1 Zeile)
Actions #2

Updated by okurz over 2 years ago

  • Copied to action #98673: [retro] Unhandled alert about job queue for nearly a day, users brought it up in chat, should have been picked up sooner size:S added
Actions #3

Updated by okurz over 2 years ago

  • Status changed from Workable to In Progress

Investigated further with mkittler.

As visible in https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=9&from=1631617176679&to=1631619288893 the first big rise in number of "Maintenance: Single Incidents" happened at 2021-09-14 11:06:00Z. SLE incident tests are scheduled by https://gitlab.suse.de/qa-maintenance/bot-ng/ and on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines?page=6&scope=all we can find one "schedule incident" job https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/583038 which took an unusual long 30 minutes where from https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines I could see that it normally takes roughly 5 minutes to complete. So likely during these 30 minutes bot-ng was very busy to schedule the way too high number of tests.

Taking a look in the OSD database I see:

openqa=> select count(id) from jobs where t_created >= '2021-09-14 11:00' and t_created <= '2021-09-14 12:00';
 count 
-------
  4954
(1 row)
openqa=> select count(id),build from jobs where t_created >= '2021-09-14 11:00' and t_created <= '2021-09-14 12:00' group by build order by count(id) desc;
 count |                           build                           
-------+-----------------------------------------------------------
   150 | :20372:gnu-compilers-hpc
   125 | :21149:kgraft-patch-SLE12-SP5_Update_21
   125 | :21146:kgraft-patch-SLE12-SP5_Update_18
   125 | :21150:kgraft-patch-SLE12-SP5_Update_22
   122 | :21190:kernel-livepatch-SLE15-SP3_Update_5
   122 | :21185:kernel-livepatch-SLE15-SP3_Update_0
   122 | :21186:kernel-livepatch-SLE15-SP3_Update_1
   122 | :21189:kernel-livepatch-SLE15-SP3_Update_4
   122 | :21188:kernel-livepatch-SLE15-SP3_Update_3
   122 | :21187:kernel-livepatch-SLE15-SP3_Update_2
   107 | :21176:kernel-livepatch-SLE15-SP2_Update_10
   107 | :21177:kernel-livepatch-SLE15-SP2_Update_11
   107 | :21178:kernel-livepatch-SLE15-SP2_Update_12
   107 | :21184:kernel-livepatch-SLE15-SP2_Update_18
   107 | :21183:kernel-livepatch-SLE15-SP2_Update_17
   107 | :21181:kernel-livepatch-SLE15-SP2_Update_15
   107 | :21182:kernel-livepatch-SLE15-SP2_Update_16
   107 | :21180:kernel-livepatch-SLE15-SP2_Update_14
   107 | :21179:kernel-livepatch-SLE15-SP2_Update_13
   107 | :21169:kernel-livepatch-SLE15-SP2_Update_3
   107 | :21170:kernel-livepatch-SLE15-SP2_Update_4
   107 | :21171:kernel-livepatch-SLE15-SP2_Update_5
   107 | :21172:kernel-livepatch-SLE15-SP2_Update_6
   107 | :21173:kernel-livepatch-SLE15-SP2_Update_7
   107 | :21174:kernel-livepatch-SLE15-SP2_Update_8
   107 | :21175:kernel-livepatch-SLE15-SP2_Update_9
    98 | :21165:kernel-livepatch-SLE15-SP1_Update_23
    98 | :21151:kernel-livepatch-SLE15_Update_20
    98 | :21152:kernel-livepatch-SLE15_Update_21
    98 | :21153:kernel-livepatch-SLE15_Update_22
    98 | :21154:kernel-livepatch-SLE15_Update_23
    98 | :21155:kernel-livepatch-SLE15_Update_24
    98 | :21156:kernel-livepatch-SLE15_Update_25
    98 | :21157:kernel-livepatch-SLE15-SP1_Update_15
    98 | :21158:kernel-livepatch-SLE15-SP1_Update_16
    98 | :21159:kernel-livepatch-SLE15-SP1_Update_17
    98 | :21160:kernel-livepatch-SLE15-SP1_Update_18
    98 | :21161:kernel-livepatch-SLE15-SP1_Update_19
    98 | :21162:kernel-livepatch-SLE15-SP1_Update_20
    98 | :21163:kernel-livepatch-SLE15-SP1_Update_21
    98 | :21164:kernel-livepatch-SLE15-SP1_Update_22
    98 | :21166:kernel-livepatch-SLE15-SP1_Update_24
    98 | :21167:kernel-livepatch-SLE15-SP1_Update_25
    98 | :21168:kernel-livepatch-SLE15-SP1_Update_26
    38 | :20912:fonttosfnt
    38 | :20989:krb5
    26 | :20645:openhpi
    22 | :19315:sapnwbootstrap-formula
    20 | :20949:rpmlint-mini
    20 | :20956:libyui
    14 | :20990:rabbitmq-server
    14 | :20272:389-ds
…
(60 rows)

and

select count(id) from jobs where t_created >= '2021-09-14 11:00' and t_created <= '2021-09-14 12:00' and build like '%livepatch%';
 count 
-------
  4208
(1 row)

meaning that we had 4.2k livepatch jobs (+375 kgraft jobs) scheduled. Now the question is: Is this expected? If yes, then we could adjust the alert to a much higher level and tell everyone that jobs will just take their time in such cases.

mmeissn confirmed in https://suse.slack.com/archives/C02CANHLANP/p1631707641265700?thread_ts=1631695768.260300&cid=C02CANHLANP that this is expected from his point of view:

i would say with the current setup: yes, expected. ... does not mean it could be improved

A later higher flank triggered more tests:

openqa=> select count(id) from jobs where t_created >= '2021-09-14 17:00' and t_created <= '2021-09-14 18:00' and (build like '%livepatch%' or build like '%kgraft%');
 count 
-------
  7837
(1 row)

confirms that this is again livepatch and kgraft. Asked again in https://suse.slack.com/archives/C02CANHLANP/p1631707782266500?thread_ts=1631695768.260300&cid=C02CANHLANP to confirm that also this is expected.

mmeissn confirmed that as well:

So... this is something that might be avoided. What happens:
We process the livepatches, but 50% or so of them have the wrong signing key first. We change the signing key and rebuild them, but there is a window of some hours between that.
In that time the openq engine has already started the first batch.
Marcus Meissner 26 minutes ago
The signing key of the patches for kernel older than March 2021 are affected, as we have a 1 year sliding window, this number will decrease continously.
Marcus Meissner 26 minutes ago
(Unless we have to rotate the signing key again)

So I will accept the current behaviour as "expected behaviour" and improve the alert thresholds accordingly.

Together with mkittler we are adjusting the grafana panel and alert thresholds. I also want to include a suggestion how to find the ratio of clones vs. all scheduled jobs and the following works in SQL:

select round ((select count(j1.id) from jobs as j1 where state = 'scheduled' and (select j2.id from jobs as j2 where j1.id = j2.clone_id limit 1) is not null) * 100. / (select count(id) from jobs where state = 'scheduled'), 2) as "percentage of clones in all scheduled jobs";
 percentage of clones in all scheduled jobs 
--------------------------------------------
                                      10.11
(1 row)
Actions #5

Updated by okurz over 2 years ago

  • Description updated (diff)
  • Due date set to 2021-09-29
  • Status changed from In Progress to Feedback
  • Priority changed from Immediate to Urgent
  • Reduced prio after we clarified that we are in an expected situation, nothing really broken
  • Added rollback steps "unpause alert"

still to be done:

Actions #6

Updated by mkittler over 2 years ago

I unpaused the alert after the SR has been merged.

Not sure what to improve further besides #98673 since with the SR we have already improved the investigation steps for the alert. Judging from the responses in the chat this number of kernel live patching jobs is expected so I suppose we cannot do much to prevent it in the future. Considering live patching is important is wouldn't likely even make sense to deprioritize those jobs.

Actions #8

Updated by okurz over 2 years ago

  • Status changed from Feedback to Resolved

mkittler wrote:

Considering live patching is important is wouldn't likely even make sense to deprioritize those jobs.

Well, "important" does not equal "urgent". To clarify with maintsec if maybe we can tweak priority I created #98736 because what I see right now is that tests like https://openqa.suse.de/tests/7120218 have a prio value of 50 which AFAIK is default. So maybe no one thought about setting a different value.

With the two follow-up tickets I also consider this enough of improvements, resolved.

Actions

Also available in: Atom PDF