Project

General

Profile

action #112193

[alert][osd] web UI: Too many Minion job failures alert size:S

Added by okurz 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-06-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

Too many Minion jobs have failed on openqa.suse.de

Acceptance criteria

  • AC1: No more alerts
  • AC2: All issues are resolved or reported in currently open tickets

Suggestions

  • Review the failed jobs on https://openqa.suse.de/minion/jobs?state=failed and create a ticket if there's not already one (see #96263 and related tickets) and the failed jobs aren't just a symptom of a bigger problem (e.g. database outage).
  • After investigation remove the failed jobs (possibly keeping one instance of a failure kind around). For the general log of the Minion job queue, checkout journalctl -fu openqa-gru.service and /var/log/openqa_gru on openqa.suse.de.
  • Probably file a ticket for the rsync issue (after narrowing it down a bit)

Rollback steps

Unpause alert "Minion Jobs"

History

#1 Updated by okurz 2 months ago

  • Description updated (diff)

#2 Updated by cdywan 2 months ago

  • Subject changed from [alert][osd] web UI: Too many Minion job failures alert to [alert][osd] web UI: Too many Minion job failures alert size:S
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler about 2 months ago

  • Assignee set to mkittler

#4 Updated by mkittler about 2 months ago

  • Status changed from Workable to Resolved

Most failures were covered by #96263 but there are also two new cases. I extended the ticket description for these new cases. I also cleaned up the Minion dashboard and resumed the alert.

Also available in: Atom PDF