action #97043
closedcoordination #96447: [epic] Failed systemd services and job age alerts
job queue hitting new record 14k jobs
0%
Description
Observation¶
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1628098325601&to=1629026623275 showed a new record high of 14k jobs after a multi-day network outage, likely related to a network loop in one of the QA labs (related to work by EngInfra).
Rollback steps¶
- Unpause alert after queue is lower again
Updated by okurz over 3 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Right now, 2021-08-17, there are 7k jobs and the queue seems to stagnate. Around 3k jobs are "investigate" jobs. As these investigate jobs always have a high prio value they end up queueing up while other jobs are scheduled and acted upon. I checked https://monitor.qa.suse.de/d/WebuiDb/webui-summary?editPanel=31&orgId=1&from=now-2d&to=now and found that 3k are outside a group (coincides with the number of investigation jobs) and most others are in "Maintenance: Single Incidents". Looking over the history of the past days https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1628098325601&to=1629026623275 looks like "Maintenance: Incidents" always had the biggest part of the schedule. So I consider it likely that these jobs are just the one producing the big load, not other products or group starving out "Maintenance: Incidents".
I also wanted to check the existing scheduled jobs and did
openqa=> select count(id) from jobs where state='scheduled' and clone_id is not null;
count
-------
0
i.e. none of our current scheduled jobs have a clone_id. I was not expecting that there are currently no retriggered jobs but I assume it's as simple as that. We still have jobs with clone_id. From tinita: The highest job id with a clone_id is 6873464, which is from today, so looks ok.
I guess retriggered jobs are quickly executed, not sure why but I guess that makes people happy :)
Updated by openqa_review over 3 years ago
- Due date set to 2021-09-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 3 years ago
- Related to action #94237: No alert about too many scheduled tests size:S added
Updated by okurz over 3 years ago
- Description updated (diff)
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/553 fixed the actual alert and it triggered correctly. I paused the alert for now as we have this ticket.
Updated by okurz over 3 years ago
Presented the current findings in the weekly QE Sync. People supported that only SLE maintenance tests are impacted and are likely also the source of problem.
To improve we identified the following areas:
- Network: We are mainly running on 1G network so communication between central OSD instance and workers is sub-optimal as well as connection outside the OSD network
- Storage: Some of our storage is slow and hence tests are slow to start, see #96554
- more workers: There is still some capacity for more workers (although we should be careful to not overload the central OSD instance)
- test design efficiency: Many tests use the interactive GUI installer and take long to add all repos, re-download the same set of updates all the time instead of relying on something like nightly snapshots, etc.
ideas:
- cancel older scheduled investigation jobs to crosscheck if they really don't have a significant impact. did that now with an SQL query
openqa=> delete from jobs where state='scheduled' and test ~ ':investigate:' and t_created <= '2021-08-16';
DELETE 2643
Checking 3 hours later I see that the job schedule seems to still increase again. I checked a random maintenance incident test overview, e.g. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk&distri=sle&version=15-SP1&groupid=236 and saw 26 jobs. but
openqa=> select count(id) from jobs where state='scheduled' and build=':20659:java-1_8_0-openjdk';
count
-------
79
so multiple sets of the same are scheduled. That's from the back of the queue. https://openqa.suse.de/tests/overview?build=:19994:ffmpeg&distri=sle&version=15&groupid=176 shows 130 jobs.
The according SQL query shows 320 jobs so again multiple sets.
- create specific tickets about the above improvement points. Otherwise I assume no squad would pick that up anyway
Updated by osukup over 3 years ago
okurz:
Checking 3 hours later I see that the job schedule seems to still increase again. I checked a random maintenance incident test overview, e.g. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk&distri=sle&version=15-SP1&groupid=236 and saw 26 jobs. but
openqa=> select count(id) from jobs where state='scheduled' and build=':20659:java-1_8_0-openjdk'; count ------- 79
so multiple sets of the same are scheduled. That's from the back of the queue. https://openqa.suse.de/tests/overview?build=:19994:ffmpeg&distri=sle&version=15&groupid=176 shows 130 jobs.
not same, different flavors, every incident has multiple product versions and usually multiple products + if is change in incident repo --> new shedule
so for example incident 19994 ffmpeg: https://smelt.suse.de/incident/19994/ and look at 'Repositories' table (and this incident is on smaller size)
Updated by okurz over 3 years ago
yes, you are right, stupid me. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk shows the correct number.
Any different idea why we seem to have that many jobs recently?
Updated by osukup over 3 years ago
okurz wrote:
yes, you are right, stupid me. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk shows the correct number.
Any different idea why we seem to have that many jobs recently
we have a pretty high amount of incidents in testing QUEUE and even more surprisingly in the staging area. We can lower the amount of tested incidents if we stop tests incidents in staging area ( cons: are -> Incident tests in openQA will start the only when in moved to testing ( so on some incidents few hours added delay) and we will lose early warning if is incident broken in openQA tests)
Updated by okurz over 3 years ago
- Status changed from In Progress to Feedback
thanks for the explanation. I prefer to keep everything as is as right now it seems that there are simply many legit incident tests and they only slow down the incident tests themselves so no other area impacted. I will set this ticket to "Feedback" and monitor the situation over the next days.
Updated by okurz about 3 years ago
- Related to action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes added
Updated by okurz about 3 years ago
- Copied to action #97859: Improve network for OSD added
Updated by okurz about 3 years ago
- Copied to action #97862: More openQA worker hardware for OSD size:M added
Updated by okurz about 3 years ago
- Copied to action #97865: [qe-core][qe-yast][y] Optimize openQA maintenance tests by using less interactive GUI installer as base added
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved