Actions
action #80538
closedflaky and misleading alerts about "openQA minion workers alert" as well as "Minion Jobs alert"
Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-11-27
Due date:
2021-04-14
% Done:
0%
Estimated time:
Description
Observation¶
On 2020-11-26 and 2020-11-27 there were the following alerts and automatic ok-messages:
- 2020-11-26 2045Z: [Alerting] openQA minion workers alert
- 2020-11-26 2046Z: [OK] openQA minion workers alert
- 2020-11-26 2106Z: [OK] openqaworker8: Minion Jobs alert
- 2020-11-27 0140Z: [Alerting] QA-Power8-4-kvm: Minion Jobs alert
- 2020-11-27 0255Z: [OK] QA-Power8-4-kvm: Minion Jobs alert
I assume no one did anything about these alerts to go to "OK" again.
the alert "openQA minion workers alert" has the description:
Minion workers down. Check systemd services on the openQA host
and the alert "$worker: Minion Jobs alert" has the description:
to remove all failed jobs on the machine ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ```
Problems¶
- meaning of alert is unclear
- alerts should be stable, i.e. not flaky
- description does not explain enough so that team members would understand what needs to be done and what the meaning is
Acceptance criteria¶
- AC1: Multiple team members can confirm that updated name and/or description explains them what to do
- AC2: Allow at least one other team member to handle the alert based on documentation
- AC3: The alerts are stable and only persist when there is a problem
Suggestions¶
- Look at the o3 minion dashboard
- Locate the alert on grafana
- Understand what the data source for the alert is and how "minion jobs" are involved
- Research what is the expected state for "minion jobs"
- Extend the description to explain, optionally reference this ticket
- Consider updating the name of the alert as well
- Stabilize alert
Actions