Project

General

Profile

Actions

action #98499

closed

[alert] web UI: Too many Minion job failures alert size:S

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2021-09-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

alert received on 2021-09-13 at the time when OSD was deployed. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=19&orgId=1&from=now-7d&to=now shows there were already 20 failed and during deployment – which could be coincidence – another minion job failed reaching 21 failed minion jobs.

Acceptance criteria

  • AC1: At least a ticket exists for each different issue
  • AC2: The alert description mentions the tickets for all known issues that could explain failures

Suggestions

  • Review current failures and ensure that a ticket exists for each type (see related tickets)
  • Remove all failed minion jobs after ensuring the problem is recorded in tickets
  • Unpause alert

Rollback measures

  • Unpause alert

Related issues 2 (2 open0 closed)

Related to openQA Project (public) - coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alertNew2020-09-01

Actions
Related to openQA Project (public) - action #70774: save_needle Minion tasks fail frequently and needles could get lostNew2020-09-01

Actions
Actions #1

Updated by okurz over 3 years ago

  • Description updated (diff)

paused alert

Actions #2

Updated by okurz over 3 years ago

  • Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added
Actions #3

Updated by okurz over 3 years ago

  • Related to action #70774: save_needle Minion tasks fail frequently and needles could get lost added
Actions #4

Updated by okurz over 3 years ago

  • Subject changed from [alert] web UI: Too many Minion job failures alert to [alert] web UI: Too many Minion job failures alert size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler over 3 years ago

  • Assignee set to mkittler
Actions #6

Updated by mkittler over 3 years ago

  • Status changed from Workable to Feedback
  • I've been following the instructions of the alert description (so the alert should be ok again soon) and extended it to cover AC2 (see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/573).
  • I've been extending #96263 to jobs failing with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)'. It is not a new problem but I've forgot to mention those jobs when creating the issue. That covers AC1 as there were no other types of failed Minion jobs.
Actions #7

Updated by mkittler over 3 years ago

  • Status changed from Feedback to Resolved

The SR has been merged and deployed.

Actions

Also available in: Atom PDF