action #181400: Minion jobs could end up retrying endlessly stalling the queue size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #181400

closed

Minion jobs could end up retrying endlessly stalling the queue size:S

Added by okurz about 1 month ago. Updated 28 days ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2025-04-25

Due date:

% Done:

Estimated time:

Description

Observation¶

args:
- /var/lib/openqa/share/tests/sle: ~
  /var/lib/openqa/share/tests/sle/products/sle/needles: ~
attempts: 1
children: []
created: 2025-04-25T09:35:49.813997Z
delayed: 2025-04-25T09:40:19.829891Z
expires: ~
finished: 2025-04-25T09:40:23.518023Z
id: 15306718
lax: 0
notes:
  gru_id: 40579714
  stop_reason: Could not find GruTask '40579714' after 7 retries, giving up
parents: []
priority: 10
queue: default
result: 1
retried: 2025-04-25T09:38:11.829891Z
retries: 7
started: 2025-04-25T09:40:23.458946Z
state: failed
task: git_clone
time: 2025-04-25T10:03:00.623617Z
worker: 1989

Acceptance criteria¶

AC1: Minion jobs are never endlessly retried
AC2: Minion jobs that are no longer needed and thus no longer retried are not ending up as failures

Suggestions¶

DONE Do the code changes https://github.com/os-autoinst/openQA/pull/6406 and https://github.com/os-autoinst/openQA/pull/6407
Check current state on OSD to ensure we are stable (no inactive Minion jobs with many retries, no failing jobs because retrying was aborted)

Rollback steps¶

Remove "web UI: Too many Minion job failures alert"-silence: https://stats.openqa-monitor.qa.suse.de/alerting/silence/c50efdbd-e417-489b-8988-ca290fde3974/edit , silence for "web UI: Too many Minion job failures alert"
Remove the failed git_clone minion jobs that have piled up so far

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz about 1 month ago

Status changed from New to Feedback

https://github.com/os-autoinst/openQA/pull/6406

Actions

Copy link

Updated by mkittler about 1 month ago

The PR https://github.com/os-autoinst/openQA/pull/6406 will delete the problematic Minion jobs that were endlessly retrying specifically.

There was also https://github.com/os-autoinst/openQA/pull/6407 created by @tinita which will make sure that such leftover Minion jobs (even if they would not be explicitly deleted) will never be endlessly retried.

Note that this is not about a problem caused by reverting the database or testresults (which we did after the 2025-04 incident). The problem is likely showing up only now because many products where scheduled at the same time using a combination of assets that made the deadlock-retry happening more often.

Actions

Copy link

Updated by nicksinger about 1 month ago

Description updated (diff)

Actions

Copy link

Updated by tinita about 1 month ago

Description updated (diff)

I silenced "web UI: Too many Minion job failures alert".

Minion jobs with missing grutasks are now failing after https://github.com/os-autoinst/openQA/pull/6407 which I did not plan and foresee.
Working on a fix.

https://openqa.suse.de/minion/jobs?state=failed

args:
- /var/lib/openqa/share/tests/sle: ~
  /var/lib/openqa/share/tests/sle/products/sle/needles: ~
attempts: 1
children: []
created: 2025-04-25T09:35:49.813997Z
delayed: 2025-04-25T09:40:19.829891Z
expires: ~
finished: 2025-04-25T09:40:23.518023Z
id: 15306718
lax: 0
notes:
  gru_id: 40579714
  stop_reason: Could not find GruTask '40579714' after 7 retries, giving up
parents: []
priority: 10
queue: default
result: 1
retried: 2025-04-25T09:38:11.829891Z
retries: 7
started: 2025-04-25T09:40:23.458946Z
state: failed
task: git_clone
time: 2025-04-25T10:03:00.623617Z
worker: 1989

Actions

Copy link

Updated by okurz about 1 month ago

Copied to action #181415: [osd][grafana] Add alert for inactive minion jobs added

Actions

Copy link

Updated by okurz about 1 month ago

Status changed from Feedback to In Progress
Assignee changed from mkittler to tinita
Priority changed from Normal to High

As agreed with tinita bumping to High and assigning to her as she is working on a fix

Actions

Copy link

Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/6409 Fix result in case of giving up gru jobs without GruTasks

Actions

Copy link

Updated by tinita about 1 month ago

Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/6409 merged

Actions

Copy link

Updated by tinita about 1 month ago

Deployments on osd currently blocked due to jenkins / obs, so I have to wait for that

Actions

Copy link

#10

Updated by tinita about 1 month ago

Description updated (diff)
Assignee deleted (~~tinita~~)

It looks like jenkins / OBS is working again. I can check shortly tomorrow if the new version is rolled out, otherwise I will unassign and someone else can do the rollback steps

Actions

Copy link

#11

Updated by tinita about 1 month ago

Description updated (diff)
Status changed from Feedback to Workable

Unassigning because of absence.
Wait until osd deployment works again and https://github.com/os-autoinst/openQA/pull/6409 is deployed. Then check git_clone minion jobs.
If everything looks good, conduct rollback steps and resolve.
If anything is unclear, ask me.

Actions

Copy link

#12

Updated by okurz about 1 month ago

Status changed from Workable to New

Actions

Copy link

#13

Updated by okurz about 1 month ago

Subject changed from minion jobs could end up retrying endlessly stalling the queue to Minion jobs could end up retrying endlessly stalling the queue size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#14

Updated by mkittler 28 days ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#15

Updated by mkittler 28 days ago · Edited

The change was deployed last Friday 2025-05-02T07:24Z and the most recent failing git_clone Minion jobs are from around 2025-05-02T07:13Z which is before. So the change works.

openqa=> select id, created, finished from minion_jobs where task = 'git_clone' and state = 'failed' order by finished desc;
    id    |            created            |           finished            
----------+-------------------------------+-------------------------------
 15402691 | 2025-05-02 09:13:50.396366+02 | 2025-05-02 09:18:11.169744+02
 15402692 | 2025-05-02 09:13:50.411072+02 | 2025-05-02 09:18:11.154065+02
 15402690 | 2025-05-02 09:13:50.368243+02 | 2025-05-02 09:18:11.136976+02
…

Cleaning all 3474 failed jobs on https://openqa.suse.de/minion/jobs?task=git_clone&state=failed&offset=0 would be a lot of clicking so I deleted them via SQL:

openqa=> delete from minion_jobs where task = 'git_clone' and state = 'failed';
DELETE 3474

I'll remove the silence as soon as the alert settles.

There are also no Minion jobs with a high number of retries (which was the initial problem we wanted to solve):

openqa=> select id, task, retries from minion_jobs where state = 'inactive';
    id    |     task      | retries 
----------+---------------+---------
 15437378 | obs_rsync_run |      73
 15438639 | obs_rsync_run |       8
 15438730 | git_clone     |       6
 15438731 | git_clone     |       6
(4 rows)

Actions

Copy link

#16

Updated by mkittler 28 days ago

Status changed from In Progress to Resolved

silence removed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #181400

Minion jobs could end up retrying endlessly stalling the queue size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz about 1 month ago

Updated by mkittler about 1 month ago

Updated by nicksinger about 1 month ago

Updated by tinita about 1 month ago

Updated by okurz about 1 month ago

Updated by okurz about 1 month ago

Updated by tinita about 1 month ago

Updated by tinita about 1 month ago

Updated by tinita about 1 month ago

Updated by tinita about 1 month ago

Updated by tinita about 1 month ago

Updated by okurz about 1 month ago

Updated by okurz about 1 month ago

Updated by mkittler 28 days ago

Updated by mkittler 28 days ago · Edited

Updated by mkittler 28 days ago