action #97043: job queue hitting new record 14k jobs - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #97043

closed

coordination #96447: [epic] Failed systemd services and job age alerts

job queue hitting new record 14k jobs

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-08-17

Due date:

% Done:

Estimated time:

Description

Observation¶

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1628098325601&to=1629026623275 showed a new record high of 14k jobs after a multi-day network outage, likely related to a network loop in one of the QA labs (related to work by EngInfra).

Rollback steps¶

Unpause alert after queue is lower again

Related issues 5 (2 open — 3 closed)

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from New to In Progress
Assignee set to okurz

Right now, 2021-08-17, there are 7k jobs and the queue seems to stagnate. Around 3k jobs are "investigate" jobs. As these investigate jobs always have a high prio value they end up queueing up while other jobs are scheduled and acted upon. I checked https://monitor.qa.suse.de/d/WebuiDb/webui-summary?editPanel=31&orgId=1&from=now-2d&to=now and found that 3k are outside a group (coincides with the number of investigation jobs) and most others are in "Maintenance: Single Incidents". Looking over the history of the past days https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1628098325601&to=1629026623275 looks like "Maintenance: Incidents" always had the biggest part of the schedule. So I consider it likely that these jobs are just the one producing the big load, not other products or group starving out "Maintenance: Incidents".

I also wanted to check the existing scheduled jobs and did

openqa=> select count(id) from jobs where state='scheduled' and clone_id is not null;
 count 
-------
     0

i.e. none of our current scheduled jobs have a clone_id. I was not expecting that there are currently no retriggered jobs but I assume it's as simple as that. We still have jobs with clone_id. From tinita: The highest job id with a clone_id is 6873464, which is from today, so looks ok.
I guess retriggered jobs are quickly executed, not sure why but I guess that makes people happy :)

Actions

Copy link

Updated by openqa_review over 3 years ago

Due date set to 2021-09-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz over 3 years ago

Related to action #94237: No alert about too many scheduled tests size:S added

Actions

Copy link

Updated by okurz over 3 years ago

Description updated (diff)

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/553 fixed the actual alert and it triggered correctly. I paused the alert for now as we have this ticket.

Actions

Copy link

Updated by okurz over 3 years ago

Presented the current findings in the weekly QE Sync. People supported that only SLE maintenance tests are impacted and are likely also the source of problem.

To improve we identified the following areas:

Network: We are mainly running on 1G network so communication between central OSD instance and workers is sub-optimal as well as connection outside the OSD network
Storage: Some of our storage is slow and hence tests are slow to start, see #96554
more workers: There is still some capacity for more workers (although we should be careful to not overload the central OSD instance)
test design efficiency: Many tests use the interactive GUI installer and take long to add all repos, re-download the same set of updates all the time instead of relying on something like nightly snapshots, etc.

ideas:

cancel older scheduled investigation jobs to crosscheck if they really don't have a significant impact. did that now with an SQL query

openqa=> delete from jobs where state='scheduled' and test ~ ':investigate:' and t_created <= '2021-08-16';
DELETE 2643

Checking 3 hours later I see that the job schedule seems to still increase again. I checked a random maintenance incident test overview, e.g. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk&distri=sle&version=15-SP1&groupid=236 and saw 26 jobs. but

openqa=> select count(id) from jobs where state='scheduled' and build=':20659:java-1_8_0-openjdk';
 count 
-------
    79

so multiple sets of the same are scheduled. That's from the back of the queue. https://openqa.suse.de/tests/overview?build=:19994:ffmpeg&distri=sle&version=15&groupid=176 shows 130 jobs.

The according SQL query shows 320 jobs so again multiple sets.

create specific tickets about the above improvement points. Otherwise I assume no squad would pick that up anyway

Actions

Copy link

Updated by osukup over 3 years ago

okurz:

Checking 3 hours later I see that the job schedule seems to still increase again. I checked a random maintenance incident test overview, e.g. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk&distri=sle&version=15-SP1&groupid=236 and saw 26 jobs. but
openqa=> select count(id) from jobs where state='scheduled' and build=':20659:java-1_8_0-openjdk';
 count 
-------
    79
so multiple sets of the same are scheduled. That's from the back of the queue. https://openqa.suse.de/tests/overview?build=:19994:ffmpeg&distri=sle&version=15&groupid=176 shows 130 jobs.

not same, different flavors, every incident has multiple product versions and usually multiple products + if is change in incident repo --> new shedule

so for example incident 19994 ffmpeg: https://smelt.suse.de/incident/19994/ and look at 'Repositories' table (and this incident is on smaller size)

Actions

Copy link

Updated by okurz over 3 years ago

yes, you are right, stupid me. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk shows the correct number.

Any different idea why we seem to have that many jobs recently?

Actions

Copy link

Updated by osukup over 3 years ago

okurz wrote:

yes, you are right, stupid me. https://openqa.suse.de/tests/overview?build=:20659:java-1_8_0-openjdk shows the correct number.

Any different idea why we seem to have that many jobs recently

we have a pretty high amount of incidents in testing QUEUE and even more surprisingly in the staging area. We can lower the amount of tested incidents if we stop tests incidents in staging area ( cons: are -> Incident tests in openQA will start the only when in moved to testing ( so on some incidents few hours added delay) and we will lose early warning if is incident broken in openQA tests)

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from In Progress to Feedback

thanks for the explanation. I prefer to keep everything as is as right now it seems that there are simply many legit incident tests and they only slow down the incident tests themselves so no other area impacted. I will set this ticket to "Feedback" and monitor the situation over the next days.

Actions

Copy link

#10

Updated by okurz over 3 years ago

Related to action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes added

Actions

Copy link

#11

Updated by okurz over 3 years ago

Copied to action #97859: Improve network for OSD added

Actions

Copy link

#12

Updated by okurz over 3 years ago

Copied to action #97862: More openQA worker hardware for OSD size:M added

Actions

Copy link

#13

Updated by okurz over 3 years ago

Copied to action #97865: [qe-core][qe-yast][y] Optimize openQA maintenance tests by using less interactive GUI installer as base added

Actions

Copy link

#14

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

Alerts are enabled again. The job queue is fine since the past weeks. mdoucha stated that retriggered jobs for #96710 were causing the main problem.
I created three follow-up tickets #97859 , #97862, #97865

Actions

Copy link

#15

Updated by okurz over 3 years ago

Due date deleted (~~2021-09-01~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #97043

job queue hitting new record 14k jobs

Observation¶

Rollback steps¶

Updated by okurz over 3 years ago

Updated by openqa_review over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by osukup over 3 years ago

Updated by okurz over 3 years ago

Updated by osukup over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago