Project

General

Profile

Actions

action #112193

closed

[alert][osd] web UI: Too many Minion job failures alert size:S

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2022-06-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

Too many Minion jobs have failed on openqa.suse.de

Acceptance criteria

  • AC1: No more alerts
  • AC2: All issues are resolved or reported in currently open tickets

Suggestions

  • Review the failed jobs on https://openqa.suse.de/minion/jobs?state=failed and create a ticket if there's not already one (see #96263 and related tickets) and the failed jobs aren't just a symptom of a bigger problem (e.g. database outage).
  • After investigation remove the failed jobs (possibly keeping one instance of a failure kind around). For the general log of the Minion job queue, checkout journalctl -fu openqa-gru.service and /var/log/openqa_gru on openqa.suse.de.
  • Probably file a ticket for the rsync issue (after narrowing it down a bit)

Rollback steps

Unpause alert "Minion Jobs"

Actions #1

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #2

Updated by livdywan over 2 years ago

  • Subject changed from [alert][osd] web UI: Too many Minion job failures alert to [alert][osd] web UI: Too many Minion job failures alert size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler over 2 years ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler over 2 years ago

  • Status changed from Workable to Resolved

Most failures were covered by #96263 but there are also two new cases. I extended the ticket description for these new cases. I also cleaned up the Minion dashboard and resumed the alert.

Actions

Also available in: Atom PDF