Project

General

Profile

Actions

action #97043

closed

coordination #96447: [epic] Failed systemd services and job age alerts

job queue hitting new record 14k jobs

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-08-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1628098325601&to=1629026623275 showed a new record high of 14k jobs after a multi-day network outage, likely related to a network loop in one of the QA labs (related to work by EngInfra).

Rollback steps

  • Unpause alert after queue is lower again

Related issues 5 (2 open3 closed)

Related to openQA Infrastructure - action #94237: No alert about too many scheduled tests size:SResolvedokurz2021-06-182021-08-30

Actions
Related to openQA Infrastructure - action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletesResolvedmkittler2021-08-102021-08-31

Actions
Copied to openQA Infrastructure - action #97859: Improve network for OSDNew

Actions
Copied to openQA Infrastructure - action #97862: More openQA worker hardware for OSD size:MResolvedokurz2022-05-23

Actions
Copied to openQA Tests - action #97865: [qe-core][qe-yast][y] Optimize openQA maintenance tests by using less interactive GUI installer as baseNew

Actions
Actions #1

Updated by okurz over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

Right now, 2021-08-17, there are 7k jobs and the queue seems to stagnate. Around 3k jobs are "investigate" jobs. As these investigate jobs always have a high prio value they end up queueing up while other jobs are scheduled and acted upon. I checked https://monitor.qa.suse.de/d/WebuiDb/webui-summary?editPanel=31&orgId=1&from=now-2d&to=now and found that 3k are outside a group (coincides with the number of investigation jobs) and most others are in "Maintenance: Single Incidents". Looking over the history of the past days https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1628098325601&to=1629026623275 looks like "Maintenance: Incidents" always had the biggest part of the schedule. So I consider it likely that these jobs are just the one producing the big load, not other products or group starving out "Maintenance: Incidents".

I also wanted to check the existing scheduled jobs and did

openqa=> select count(id) from jobs where state='scheduled' and clone_id is not null;
 count 
-------
     0

i.e. none of our current scheduled jobs have a clone_id. I was not expecting that there are currently no retriggered jobs but I assume it's as simple as that. We still have jobs with clone_id. From tinita: The highest job id with a clone_id is 6873464, which is from today, so looks ok.
I guess retriggered jobs are quickly executed, not sure why but I guess that makes people happy :)

Actions #2

Updated by openqa_review over 2 years ago

  • Due date set to 2021-09-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions #3

Updated by okurz over 2 years ago

  • Related to action #94237: No alert about too many scheduled tests size:S added
Actions #4

Updated by okurz over 2 years ago

  • Description updated (diff)

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/553 fixed the actual alert and it triggered correctly. I paused the alert for now as we have this ticket.

Actions #5

Updated by okurz over 2 years ago

Presented the current findings in the weekly QE Sync. People supported that only SLE maintenance tests are impacted and are likely also the source of problem.

To improve we identified the following areas:

  • Network: We are mainly running on 1G network so communication between central OSD instance and workers is sub-optimal as well as connection outside the OSD network
  • Storage: Some of our storage is slow and hence tests are slow to start, see #96554
  • more workers: There is still some capacity for more workers (although we should be careful to not overload the central OSD instance)
  • test design efficiency: Many tests use the interactive GUI installer and take long to add all repos, re-download the same set of updates all the time instead of relying on something like nightly snapshots, etc.

ideas:

  • cancel older scheduled investigation jobs to crosscheck if they really don't have a significant impact. did that now with an SQL query
openqa=> delete from jobs where state='scheduled' and test ~ ':investigate:' and t_created <= '2021-08-16';
DELETE 2643

Checking 3 hours later I see that the job schedule seems to still increase again. I checked a random maintenance incident test overview, e.g. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk&distri=sle&version=15-SP1&groupid=236 and saw 26 jobs. but

openqa=> select count(id) from jobs where state='scheduled' and build=':20659:java-1_8_0-openjdk';
 count 
-------
    79

so multiple sets of the same are scheduled. That's from the back of the queue. https://openqa.suse.de/tests/overview?build=:19994:ffmpeg&distri=sle&version=15&groupid=176 shows 130 jobs.

The according SQL query shows 320 jobs so again multiple sets.

  • create specific tickets about the above improvement points. Otherwise I assume no squad would pick that up anyway
Actions #6

Updated by osukup over 2 years ago

okurz:

Checking 3 hours later I see that the job schedule seems to still increase again. I checked a random maintenance incident test overview, e.g. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk&distri=sle&version=15-SP1&groupid=236 and saw 26 jobs. but

openqa=> select count(id) from jobs where state='scheduled' and build=':20659:java-1_8_0-openjdk';
 count 
-------
    79

so multiple sets of the same are scheduled. That's from the back of the queue. https://openqa.suse.de/tests/overview?build=:19994:ffmpeg&distri=sle&version=15&groupid=176 shows 130 jobs.

not same, different flavors, every incident has multiple product versions and usually multiple products + if is change in incident repo --> new shedule

so for example incident 19994 ffmpeg: https://smelt.suse.de/incident/19994/ and look at 'Repositories' table (and this incident is on smaller size)

Actions #7

Updated by okurz over 2 years ago

yes, you are right, stupid me. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk shows the correct number.

Any different idea why we seem to have that many jobs recently?

Actions #8

Updated by osukup over 2 years ago

okurz wrote:

yes, you are right, stupid me. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk shows the correct number.

Any different idea why we seem to have that many jobs recently

we have a pretty high amount of incidents in testing QUEUE and even more surprisingly in the staging area. We can lower the amount of tested incidents if we stop tests incidents in staging area ( cons: are -> Incident tests in openQA will start the only when in moved to testing ( so on some incidents few hours added delay) and we will lose early warning if is incident broken in openQA tests)

Actions #9

Updated by okurz over 2 years ago

  • Status changed from In Progress to Feedback

thanks for the explanation. I prefer to keep everything as is as right now it seems that there are simply many legit incident tests and they only slow down the incident tests themselves so no other area impacted. I will set this ticket to "Feedback" and monitor the situation over the next days.

Actions #10

Updated by okurz over 2 years ago

  • Related to action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes added
Actions #11

Updated by okurz over 2 years ago

Actions #12

Updated by okurz over 2 years ago

  • Copied to action #97862: More openQA worker hardware for OSD size:M added
Actions #13

Updated by okurz over 2 years ago

  • Copied to action #97865: [qe-core][qe-yast][y] Optimize openQA maintenance tests by using less interactive GUI installer as base added
Actions #14

Updated by okurz over 2 years ago

  • Status changed from Feedback to Resolved

Alerts are enabled again. The job queue is fine since the past weeks. mdoucha stated that retriggered jobs for #96710 were causing the main problem.
I created three follow-up tickets #97859 , #97862, #97865

Actions #15

Updated by okurz over 2 years ago

  • Due date deleted (2021-09-01)
Actions

Also available in: Atom PDF