action #80538

Updated by okurz 11 months ago

## Observation

On 2020-11-26 and 2020-11-27 there were the following alerts and automatic ok-messages:

* 2020-11-26 2045Z: [Alerting] openQA minion workers alert
* 2020-11-26 2046Z: [OK] openQA minion workers alert
* 2020-11-26 2106Z: [OK] openqaworker8: Minion Jobs alert
* 2020-11-27 0140Z: [Alerting] QA-Power8-4-kvm: Minion Jobs alert
* 2020-11-27 0255Z: [OK] QA-Power8-4-kvm: Minion Jobs alert

I assume no one did anything about these alerts to go to "OK" again.

the alert "openQA minion workers alert" has the description:

Minion workers down. Check systemd services on the openQA host

and the alert "$worker: Minion Jobs alert" has the description:

to remove all failed jobs on the machine ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ```

## Problems

* meaning of alert is unclear
* alerts should be stable, i.e. not flaky
* description does not explain enough so that team members would understand what needs to be done and what the meaning is

## Acceptance criteria
* **AC1:** Multiple team members can confirm that updated name and/or description explains them what to do Define and document the intention of the alert, rename if needed
* **AC2:** Allow at least one other team member to handle the alert based on documentation
* **AC3:** The alerts are alert is stable and only persist persists when there is there's a problem

## Suggestions

* Look at the [o3 minion dashboard](
* Locate the [alert definition]( or alert on [grafana](
* Understand what the data source for the alert is and how "minion jobs" are involved
* Research what is the expected state for "minion jobs"
* Extend the description to explain, optionally reference this ticket
* Consider updating the name of the alert as well
* Stabilize alert