action #181400
closedMinion jobs could end up retrying endlessly stalling the queue size:S
Description
Observation¶
args:
- /var/lib/openqa/share/tests/sle: ~
/var/lib/openqa/share/tests/sle/products/sle/needles: ~
attempts: 1
children: []
created: 2025-04-25T09:35:49.813997Z
delayed: 2025-04-25T09:40:19.829891Z
expires: ~
finished: 2025-04-25T09:40:23.518023Z
id: 15306718
lax: 0
notes:
gru_id: 40579714
stop_reason: Could not find GruTask '40579714' after 7 retries, giving up
parents: []
priority: 10
queue: default
result: 1
retried: 2025-04-25T09:38:11.829891Z
retries: 7
started: 2025-04-25T09:40:23.458946Z
state: failed
task: git_clone
time: 2025-04-25T10:03:00.623617Z
worker: 1989
Acceptance criteria¶
- AC1: Minion jobs are never endlessly retried
- AC2: Minion jobs that are no longer needed and thus no longer retried are not ending up as failures
Suggestions¶
- DONE Do the code changes https://github.com/os-autoinst/openQA/pull/6406 and https://github.com/os-autoinst/openQA/pull/6407
- Check current state on OSD to ensure we are stable (no inactive Minion jobs with many retries, no failing jobs because retrying was aborted)
Rollback steps¶
- Remove "web UI: Too many Minion job failures alert"-silence: https://stats.openqa-monitor.qa.suse.de/alerting/silence/c50efdbd-e417-489b-8988-ca290fde3974/edit , silence for "web UI: Too many Minion job failures alert"
- Remove the failed
git_clone
minion jobs that have piled up so far
Updated by okurz about 1 month ago
- Status changed from New to Feedback
Updated by mkittler about 1 month ago
The PR https://github.com/os-autoinst/openQA/pull/6406 will delete the problematic Minion jobs that were endlessly retrying specifically.
There was also https://github.com/os-autoinst/openQA/pull/6407 created by @tinita which will make sure that such leftover Minion jobs (even if they would not be explicitly deleted) will never be endlessly retried.
Note that this is not about a problem caused by reverting the database or testresults (which we did after the 2025-04 incident). The problem is likely showing up only now because many products where scheduled at the same time using a combination of assets that made the deadlock-retry happening more often.
Updated by tinita about 1 month ago
- Description updated (diff)
I silenced "web UI: Too many Minion job failures alert".
Minion jobs with missing grutasks are now failing after https://github.com/os-autoinst/openQA/pull/6407 which I did not plan and foresee.
Working on a fix.
https://openqa.suse.de/minion/jobs?state=failed
args:
- /var/lib/openqa/share/tests/sle: ~
/var/lib/openqa/share/tests/sle/products/sle/needles: ~
attempts: 1
children: []
created: 2025-04-25T09:35:49.813997Z
delayed: 2025-04-25T09:40:19.829891Z
expires: ~
finished: 2025-04-25T09:40:23.518023Z
id: 15306718
lax: 0
notes:
gru_id: 40579714
stop_reason: Could not find GruTask '40579714' after 7 retries, giving up
parents: []
priority: 10
queue: default
result: 1
retried: 2025-04-25T09:38:11.829891Z
retries: 7
started: 2025-04-25T09:40:23.458946Z
state: failed
task: git_clone
time: 2025-04-25T10:03:00.623617Z
worker: 1989
Updated by okurz about 1 month ago
- Copied to action #181415: [osd][grafana] Add alert for inactive minion jobs added
Updated by okurz about 1 month ago
- Status changed from Feedback to In Progress
- Assignee changed from mkittler to tinita
- Priority changed from Normal to High
As agreed with tinita bumping to High and assigning to her as she is working on a fix
Updated by tinita about 1 month ago
https://github.com/os-autoinst/openQA/pull/6409 Fix result in case of giving up gru jobs without GruTasks
Updated by tinita about 1 month ago
- Status changed from In Progress to Feedback
Updated by tinita about 1 month ago
Deployments on osd currently blocked due to jenkins / obs, so I have to wait for that
Updated by tinita about 1 month ago
- Description updated (diff)
- Assignee deleted (
tinita)
It looks like jenkins / OBS is working again. I can check shortly tomorrow if the new version is rolled out, otherwise I will unassign and someone else can do the rollback steps
Updated by tinita about 1 month ago
- Description updated (diff)
- Status changed from Feedback to Workable
Unassigning because of absence.
Wait until osd deployment works again and https://github.com/os-autoinst/openQA/pull/6409 is deployed. Then check git_clone minion jobs.
If everything looks good, conduct rollback steps and resolve.
If anything is unclear, ask me.
Updated by okurz about 1 month ago
- Subject changed from minion jobs could end up retrying endlessly stalling the queue to Minion jobs could end up retrying endlessly stalling the queue size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler 28 days ago · Edited
The change was deployed last Friday 2025-05-02T07:24Z and the most recent failing git_clone
Minion jobs are from around 2025-05-02T07:13Z which is before. So the change works.
openqa=> select id, created, finished from minion_jobs where task = 'git_clone' and state = 'failed' order by finished desc;
id | created | finished
----------+-------------------------------+-------------------------------
15402691 | 2025-05-02 09:13:50.396366+02 | 2025-05-02 09:18:11.169744+02
15402692 | 2025-05-02 09:13:50.411072+02 | 2025-05-02 09:18:11.154065+02
15402690 | 2025-05-02 09:13:50.368243+02 | 2025-05-02 09:18:11.136976+02
…
Cleaning all 3474 failed jobs on https://openqa.suse.de/minion/jobs?task=git_clone&state=failed&offset=0 would be a lot of clicking so I deleted them via SQL:
openqa=> delete from minion_jobs where task = 'git_clone' and state = 'failed';
DELETE 3474
I'll remove the silence as soon as the alert settles.
There are also no Minion jobs with a high number of retries (which was the initial problem we wanted to solve):
openqa=> select id, task, retries from minion_jobs where state = 'inactive';
id | task | retries
----------+---------------+---------
15437378 | obs_rsync_run | 73
15438639 | obs_rsync_run | 8
15438730 | git_clone | 6
15438731 | git_clone | 6
(4 rows)