action #80538
Updated by livdywan about 4 years ago
## Observation On 2020-11-26 and 2020-11-27 there were the following alerts and automatic ok-messages: * 2020-11-26 2045Z: [Alerting] openQA minion workers alert * 2020-11-26 2046Z: [OK] openQA minion workers alert * 2020-11-26 2106Z: [OK] openqaworker8: Minion Jobs alert * 2020-11-27 0140Z: [Alerting] QA-Power8-4-kvm: Minion Jobs alert * 2020-11-27 0255Z: [OK] QA-Power8-4-kvm: Minion Jobs alert I assume no one did anything about these alerts to go to "OK" again. the alert "openQA minion workers alert" has the description: ``` Minion workers down. Check systemd services on the openQA host ``` and the alert "$worker: Minion Jobs alert" has the description: ``` to remove all failed jobs on the machine ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ``` ``` ## Problems * meaning of alert is unclear * alerts should be stable, i.e. not flaky * description does not explain enough so that team members would understand what needs to be done and what the meaning is ## Acceptance criteria * **AC1:** Define and document Multiple team members can confirm that the intention of the alert, rename updated name and/or description explains them what to do if needed * **AC2:** Allow at least one other team member to handle the alert based on documentation triggers * **AC3:** The alert is **AC2:** Alerts are stable and persists when there's a problem ## Suggestions * Look at the [o3 minion dashboard](https://openqa.opensuse.org/minion) * Locate the [alert definition](https://gitlab.suse.de/openqa/grafana-webhook-actions) or alert on [grafana](https://stats.openqa-monitor.qa.suse.de/) * Understand what the data source for the alert is and how "minion jobs" are involved * Research what is the expected state for "minion jobs" * Extend the description to explain, optionally reference this ticket * Consider updating the name of the alert as well * Stabilize alert