Actions
action #98499
closed[alert] web UI: Too many Minion job failures alert size:S
Start date:
2021-09-13
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
alert received on 2021-09-13 at the time when OSD was deployed. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=19&orgId=1&from=now-7d&to=now shows there were already 20 failed and during deployment – which could be coincidence – another minion job failed reaching 21 failed minion jobs.
Acceptance criteria¶
- AC1: At least a ticket exists for each different issue
- AC2: The alert description mentions the tickets for all known issues that could explain failures
Suggestions¶
- Review current failures and ensure that a ticket exists for each type (see related tickets)
- Remove all failed minion jobs after ensuring the problem is recorded in tickets
- Unpause alert
Rollback measures¶
- Unpause alert
Updated by okurz about 3 years ago
- Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added
Updated by okurz about 3 years ago
- Related to action #70774: save_needle Minion tasks fail frequently and needles could get lost added
Updated by okurz about 3 years ago
- Subject changed from [alert] web UI: Too many Minion job failures alert to [alert] web UI: Too many Minion job failures alert size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 3 years ago
- Status changed from Workable to Feedback
- I've been following the instructions of the alert description (so the alert should be ok again soon) and extended it to cover AC2 (see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/573).
- I've been extending #96263 to jobs failing with the result
'Job terminated unexpectedly (exit code: 0, signal: 15)'
. It is not a new problem but I've forgot to mention those jobs when creating the issue. That covers AC1 as there were no other types of failed Minion jobs.
Updated by mkittler about 3 years ago
- Status changed from Feedback to Resolved
The SR has been merged and deployed.
Actions