Project

General

Profile

Actions

action #181400

closed

Minion jobs could end up retrying endlessly stalling the queue size:S

Added by okurz about 1 month ago. Updated 28 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-04-25
Due date:
% Done:

0%

Estimated time:

Description

Observation

args:
- /var/lib/openqa/share/tests/sle: ~
  /var/lib/openqa/share/tests/sle/products/sle/needles: ~
attempts: 1
children: []
created: 2025-04-25T09:35:49.813997Z
delayed: 2025-04-25T09:40:19.829891Z
expires: ~
finished: 2025-04-25T09:40:23.518023Z
id: 15306718
lax: 0
notes:
  gru_id: 40579714
  stop_reason: Could not find GruTask '40579714' after 7 retries, giving up
parents: []
priority: 10
queue: default
result: 1
retried: 2025-04-25T09:38:11.829891Z
retries: 7
started: 2025-04-25T09:40:23.458946Z
state: failed
task: git_clone
time: 2025-04-25T10:03:00.623617Z
worker: 1989

Acceptance criteria

  • AC1: Minion jobs are never endlessly retried
  • AC2: Minion jobs that are no longer needed and thus no longer retried are not ending up as failures

Suggestions

Rollback steps


Related issues 1 (1 open0 closed)

Copied to openQA Infrastructure (public) - action #181415: [osd][grafana] Add alert for inactive minion jobsNew2025-04-25

Actions
Actions #1

Updated by okurz about 1 month ago

  • Status changed from New to Feedback
Actions #2

Updated by mkittler about 1 month ago

The PR https://github.com/os-autoinst/openQA/pull/6406 will delete the problematic Minion jobs that were endlessly retrying specifically.

There was also https://github.com/os-autoinst/openQA/pull/6407 created by @tinita which will make sure that such leftover Minion jobs (even if they would not be explicitly deleted) will never be endlessly retried.

Note that this is not about a problem caused by reverting the database or testresults (which we did after the 2025-04 incident). The problem is likely showing up only now because many products where scheduled at the same time using a combination of assets that made the deadlock-retry happening more often.

Actions #3

Updated by nicksinger about 1 month ago

  • Description updated (diff)
Actions #4

Updated by tinita about 1 month ago

  • Description updated (diff)

I silenced "web UI: Too many Minion job failures alert".

Minion jobs with missing grutasks are now failing after https://github.com/os-autoinst/openQA/pull/6407 which I did not plan and foresee.
Working on a fix.

https://openqa.suse.de/minion/jobs?state=failed

args:
- /var/lib/openqa/share/tests/sle: ~
  /var/lib/openqa/share/tests/sle/products/sle/needles: ~
attempts: 1
children: []
created: 2025-04-25T09:35:49.813997Z
delayed: 2025-04-25T09:40:19.829891Z
expires: ~
finished: 2025-04-25T09:40:23.518023Z
id: 15306718
lax: 0
notes:
  gru_id: 40579714
  stop_reason: Could not find GruTask '40579714' after 7 retries, giving up
parents: []
priority: 10
queue: default
result: 1
retried: 2025-04-25T09:38:11.829891Z
retries: 7
started: 2025-04-25T09:40:23.458946Z
state: failed
task: git_clone
time: 2025-04-25T10:03:00.623617Z
worker: 1989
Actions #5

Updated by okurz about 1 month ago

  • Copied to action #181415: [osd][grafana] Add alert for inactive minion jobs added
Actions #6

Updated by okurz about 1 month ago

  • Status changed from Feedback to In Progress
  • Assignee changed from mkittler to tinita
  • Priority changed from Normal to High

As agreed with tinita bumping to High and assigning to her as she is working on a fix

Actions #7

Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/6409 Fix result in case of giving up gru jobs without GruTasks

Actions #8

Updated by tinita about 1 month ago

  • Status changed from In Progress to Feedback
Actions #9

Updated by tinita about 1 month ago

Deployments on osd currently blocked due to jenkins / obs, so I have to wait for that

Actions #10

Updated by tinita about 1 month ago

  • Description updated (diff)
  • Assignee deleted (tinita)

It looks like jenkins / OBS is working again. I can check shortly tomorrow if the new version is rolled out, otherwise I will unassign and someone else can do the rollback steps

Actions #11

Updated by tinita about 1 month ago

  • Description updated (diff)
  • Status changed from Feedback to Workable

Unassigning because of absence.
Wait until osd deployment works again and https://github.com/os-autoinst/openQA/pull/6409 is deployed. Then check git_clone minion jobs.
If everything looks good, conduct rollback steps and resolve.
If anything is unclear, ask me.

Actions #12

Updated by okurz about 1 month ago

  • Status changed from Workable to New
Actions #13

Updated by okurz about 1 month ago

  • Subject changed from minion jobs could end up retrying endlessly stalling the queue to Minion jobs could end up retrying endlessly stalling the queue size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #14

Updated by mkittler 28 days ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #15

Updated by mkittler 28 days ago · Edited

The change was deployed last Friday 2025-05-02T07:24Z and the most recent failing git_clone Minion jobs are from around 2025-05-02T07:13Z which is before. So the change works.

openqa=> select id, created, finished from minion_jobs where task = 'git_clone' and state = 'failed' order by finished desc;
    id    |            created            |           finished            
----------+-------------------------------+-------------------------------
 15402691 | 2025-05-02 09:13:50.396366+02 | 2025-05-02 09:18:11.169744+02
 15402692 | 2025-05-02 09:13:50.411072+02 | 2025-05-02 09:18:11.154065+02
 15402690 | 2025-05-02 09:13:50.368243+02 | 2025-05-02 09:18:11.136976+02
…

Cleaning all 3474 failed jobs on https://openqa.suse.de/minion/jobs?task=git_clone&state=failed&offset=0 would be a lot of clicking so I deleted them via SQL:

openqa=> delete from minion_jobs where task = 'git_clone' and state = 'failed';
DELETE 3474

I'll remove the silence as soon as the alert settles.


There are also no Minion jobs with a high number of retries (which was the initial problem we wanted to solve):

openqa=> select id, task, retries from minion_jobs where state = 'inactive';
    id    |     task      | retries 
----------+---------------+---------
 15437378 | obs_rsync_run |      73
 15438639 | obs_rsync_run |       8
 15438730 | git_clone     |       6
 15438731 | git_clone     |       6
(4 rows)
Actions #16

Updated by mkittler 28 days ago

  • Status changed from In Progress to Resolved

silence removed

Actions

Also available in: Atom PDF