Project

General

Profile

Actions

action #80538

closed

flaky and misleading alerts about "openQA minion workers alert" as well as "Minion Jobs alert"

Added by okurz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-11-27
Due date:
2021-04-14
% Done:

0%

Estimated time:

Description

Observation

On 2020-11-26 and 2020-11-27 there were the following alerts and automatic ok-messages:

  • 2020-11-26 2045Z: [Alerting] openQA minion workers alert
  • 2020-11-26 2046Z: [OK] openQA minion workers alert
  • 2020-11-26 2106Z: [OK] openqaworker8: Minion Jobs alert
  • 2020-11-27 0140Z: [Alerting] QA-Power8-4-kvm: Minion Jobs alert
  • 2020-11-27 0255Z: [OK] QA-Power8-4-kvm: Minion Jobs alert

I assume no one did anything about these alerts to go to "OK" again.

the alert "openQA minion workers alert" has the description:

Minion workers down. Check systemd services on the openQA host 

and the alert "$worker: Minion Jobs alert" has the description:

to remove all failed jobs on the machine ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ``` 

Problems

  • meaning of alert is unclear
  • alerts should be stable, i.e. not flaky
  • description does not explain enough so that team members would understand what needs to be done and what the meaning is

Acceptance criteria

  • AC1: Multiple team members can confirm that updated name and/or description explains them what to do
  • AC2: Allow at least one other team member to handle the alert based on documentation
  • AC3: The alerts are stable and only persist when there is a problem

Suggestions

  • Look at the o3 minion dashboard
  • Locate the alert on grafana
  • Understand what the data source for the alert is and how "minion jobs" are involved
  • Research what is the expected state for "minion jobs"
  • Extend the description to explain, optionally reference this ticket
  • Consider updating the name of the alert as well
  • Stabilize alert

Related issues 2 (1 open1 closed)

Related to openQA Project - action #89560: Add alert for blocked gitlab account when users are unable to save/commit needlesWorkable2021-03-05

Actions
Has duplicate openQA Infrastructure - action #78061: [Alerting] openQA minion workers alert - alert turned "OK" again after 20 minutes and we don't know what was wrongRejectedokurz2020-11-16

Actions
Actions

Also available in: Atom PDF