Project

General

Profile

action #98499

[alert] web UI: Too many Minion job failures alert size:S

Added by okurz about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-09-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

alert received on 2021-09-13 at the time when OSD was deployed. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=19&orgId=1&from=now-7d&to=now shows there were already 20 failed and during deployment – which could be coincidence – another minion job failed reaching 21 failed minion jobs.

Acceptance criteria

  • AC1: At least a ticket exists for each different issue
  • AC2: The alert description mentions the tickets for all known issues that could explain failures

Suggestions

  • Review current failures and ensure that a ticket exists for each type (see related tickets)
  • Remove all failed minion jobs after ensuring the problem is recorded in tickets
  • Unpause alert

Rollback measures

  • Unpause alert

Related issues

Related to openQA Infrastructure - coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alertNew2020-09-01

Related to openQA Project - action #70774: save_needle Minion tasks fail frequentlyNew2020-09-01

History

#1 Updated by okurz about 1 month ago

  • Description updated (diff)

paused alert

#2 Updated by okurz about 1 month ago

  • Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added

#3 Updated by okurz about 1 month ago

  • Related to action #70774: save_needle Minion tasks fail frequently added

#4 Updated by okurz about 1 month ago

  • Subject changed from [alert] web UI: Too many Minion job failures alert to [alert] web UI: Too many Minion job failures alert size:S
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by mkittler about 1 month ago

  • Assignee set to mkittler

#6 Updated by mkittler about 1 month ago

  • Status changed from Workable to Feedback
  • I've been following the instructions of the alert description (so the alert should be ok again soon) and extended it to cover AC2 (see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/573).
  • I've been extending #96263 to jobs failing with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)'. It is not a new problem but I've forgot to mention those jobs when creating the issue. That covers AC1 as there were no other types of failed Minion jobs.

#7 Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

The SR has been merged and deployed.

Also available in: Atom PDF