Project

General

Profile

action #80538

flaky and misleading alerts about "openQA minion workers alert" as well as "Minion Jobs alert"

Added by okurz about 2 months ago.

Status:
Workable
Priority:
High
Assignee:
-
Target version:
Start date:
2020-11-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

On 2020-11-26 and 2020-11-27 there were the following alerts and automatic ok-messages:

  • 2020-11-26 2045Z: [Alerting] openQA minion workers alert
  • 2020-11-26 2046Z: [OK] openQA minion workers alert
  • 2020-11-26 2106Z: [OK] openqaworker8: Minion Jobs alert
  • 2020-11-27 0140Z: [Alerting] QA-Power8-4-kvm: Minion Jobs alert
  • 2020-11-27 0255Z: [OK] QA-Power8-4-kvm: Minion Jobs alert

I assume no one did anything about these alerts to go to "OK" again.

the alert "openQA minion workers alert" has the description:

Minion workers down. Check systemd services on the openQA host 

and the alert "$worker: Minion Jobs alert" has the description:

to remove all failed jobs on the machine ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ``` 

Problems

  • meaning of alert is unclear
  • alerts should be stable, i.e. not flaky
  • description does not explain enough so that team members would understand what needs to be done and what the meaning is

Acceptance criteria

  • AC1: Multiple team members can confirm that the updated name and/or description explains them what to do if the alert triggers
  • AC2: Alerts are stable

Suggestions

  • Understand what the data source for the alert is and how "minion jobs" are involved
  • Research what is the expected state for "minion jobs"
  • Extend the description to explain, optionally reference this ticket
  • Consider updating the name of the alert as well
  • Stabilize alert

Also available in: Atom PDF