Project

General

Profile

Actions

action #182021

closed

coordination #161414: [epic] Improved salt based infrastructure management

[alert] web UI: Too many Minion job failures alert

Added by gpathak 27 days ago. Updated 26 days ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-23
Due date:
% Done:

0%

Estimated time:

Description

Observation

Date: Wed, 08 May 2025

https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1

Looking at https://openqa.suse.de/minion/jobs?state=failed most failed jobs seem to be obs_rsync.

---
args:
- project: SUSE:SLFO:1.1:Staging:S
  url: https://api.suse.de/build/SUSE:SLFO:1.1:Staging:S/_result?package=000product
attempts: 1
children: []
created: 2025-05-06T11:51:52.133879Z
delayed: 2025-05-06T11:51:52.133879Z
expires: ~
finished: 2025-05-06T11:51:52.605078Z
id: 15461168
lax: 0
notes:
  gru_id: 40709522
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: read_files.sh failed for SUSE:SLFO:1.1:Staging:S in enviroment SUSE:SLFO:1.1:Staging:S
retried: ~
retries: 0
started: 2025-05-06T11:51:52.141177Z
state: failed
task: obs_rsync_run
time: 2025-05-08T10:10:45.251805Z
worker: 1993

however more recent failures look like this

---
args:
- project: SUSE:SLFO:1.1:Staging:L
  url: https://api.suse.de/build/SUSE:SLFO:1.1:Staging:L/_result?package=000product
attempts: 1
children: []
created: 2025-05-08T06:29:49.087313Z
delayed: 2025-05-08T06:29:49.087313Z
expires: ~
finished: 2025-05-08T06:29:54.475994Z
id: 15500317
lax: 0
notes:
  gru_id: 40744546
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: read_files.sh failed for SUSE:SLFO:1.1:Staging:L in enviroment SUSE:SLFO:1.1:Staging:L
retried: ~
retries: 0
started: 2025-05-08T06:29:49.089502Z
state: failed
task: obs_rsync_run
time: 2025-05-08T10:08:47.877942Z
worker: 1995

Suggestions

  • Confirm if this is a (temporary) network connectivity issue OR a case of repos deleted which are still getting picked up by OBS sync
  • Look into ignoring related failes OR adjusting repo configs
    • Or ask nicely if one of the maintainers would care to fix the config
  • Also check older failed minion jobs, consider filing new tickets if there are separate issues there
  • Look at schedules configured in openqa-trigger-from-ibs-plugin

Related issues 3 (1 open2 closed)

Related to openQA Project (public) - action #179038: Gracious handling of longer remote git clones outages size:SResolvedmkittler2025-03-17

Actions
Related to openQA Infrastructure (public) - action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:SWorkablelivdywan2025-04-142025-06-04

Actions
Copied from openQA Infrastructure (public) - action #176013: [alert] web UI: Too many Minion job failures alert size:SResolvedybonatakis2025-01-23

Actions
Actions #1

Updated by gpathak 27 days ago

  • Copied from action #176013: [alert] web UI: Too many Minion job failures alert size:S added
Actions #2

Updated by tinita 27 days ago

  • Related to action #179038: Gracious handling of longer remote git clones outages size:S added
Actions #3

Updated by tinita 27 days ago

  • Related to action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S added
Actions #4

Updated by tinita 27 days ago

Note that obs_rsync failures are not counted in grafana.

There are failing git_clone tasks from 2 days ago, related to #179038. I guess I should have deleted them when reopening the ticket.
I will do this now.

For the obs_rsync failures there is #180962

So I guess this ticket can be closed.

Actions #5

Updated by okurz 26 days ago

  • Priority changed from Normal to Urgent

So far I see no message about mitigations being applied hence bumping to "Urgent"

Actions #6

Updated by tinita 26 days ago

Like I wrote, I deleted the git_clone minion jobs, so the alert is resolved, and the alert is not about obs_rsync. I suggest rejecting this ticket.

Actions #7

Updated by okurz 26 days ago

  • Status changed from New to Resolved
  • Assignee set to tinita

@tinita well then I see it that you actually "resolved" an issue

Actions

Also available in: Atom PDF