Project

General

Profile

Actions

action #80538

closed

flaky and misleading alerts about "openQA minion workers alert" as well as "Minion Jobs alert"

Added by okurz almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-11-27
Due date:
2021-04-14
% Done:

0%

Estimated time:

Description

Observation

On 2020-11-26 and 2020-11-27 there were the following alerts and automatic ok-messages:

  • 2020-11-26 2045Z: [Alerting] openQA minion workers alert
  • 2020-11-26 2046Z: [OK] openQA minion workers alert
  • 2020-11-26 2106Z: [OK] openqaworker8: Minion Jobs alert
  • 2020-11-27 0140Z: [Alerting] QA-Power8-4-kvm: Minion Jobs alert
  • 2020-11-27 0255Z: [OK] QA-Power8-4-kvm: Minion Jobs alert

I assume no one did anything about these alerts to go to "OK" again.

the alert "openQA minion workers alert" has the description:

Minion workers down. Check systemd services on the openQA host 

and the alert "$worker: Minion Jobs alert" has the description:

to remove all failed jobs on the machine ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ``` 

Problems

  • meaning of alert is unclear
  • alerts should be stable, i.e. not flaky
  • description does not explain enough so that team members would understand what needs to be done and what the meaning is

Acceptance criteria

  • AC1: Multiple team members can confirm that updated name and/or description explains them what to do
  • AC2: Allow at least one other team member to handle the alert based on documentation
  • AC3: The alerts are stable and only persist when there is a problem

Suggestions

  • Look at the o3 minion dashboard
  • Locate the alert on grafana
  • Understand what the data source for the alert is and how "minion jobs" are involved
  • Research what is the expected state for "minion jobs"
  • Extend the description to explain, optionally reference this ticket
  • Consider updating the name of the alert as well
  • Stabilize alert

Related issues 2 (1 open1 closed)

Related to openQA Project - action #89560: Add alert for blocked gitlab account when users are unable to save/commit needlesWorkable2021-03-05

Actions
Has duplicate openQA Infrastructure - action #78061: [Alerting] openQA minion workers alert - alert turned "OK" again after 20 minutes and we don't know what was wrongRejectedokurz2020-11-16

Actions
Actions #1

Updated by livdywan over 3 years ago

  • Description updated (diff)
  • Priority changed from High to Normal

I lowered the Priority since I've not seen it at all since. Determining where the actual definition is is part of the ticket, so if there is something that was deleted or disabled I wouldn't know. But I added the obvious places to check.

Actions #2

Updated by okurz over 3 years ago

  • Description updated (diff)

We discussed that what you wrote as "acceptance criteria" are suggestions which we already have. I changed it back to be real criteria but kept your addition. The project https://gitlab.suse.de/openqa/grafana-webhook-actions does not contain any related alert definitions.

Actions #3

Updated by livdywan over 3 years ago

[Alerting] openQA minion workers alert

Minion workers down. Check systemd services on the openQA host

Metric name
Sum

Value
0.984

Saw this alert, followed by an [OK]. I don't know what 0.984 tells me here. But it seems like it's still flaky since I didn't see a ticket or chat conversations about someone having fixed problems. Please correct me if I'm wrong here.

Actions #4

Updated by livdywan over 3 years ago

Metric name
Sum

Value
0.995

As seen today, also followed by an OK shortly after, like the previous one. On a side note the email didn't render properly in TB either.

Actions #5

Updated by livdywan over 3 years ago

Can anyone remind me why #78061 is not a duplicate of this?

Actions #6

Updated by okurz over 3 years ago

  • Has duplicate action #78061: [Alerting] openQA minion workers alert - alert turned "OK" again after 20 minutes and we don't know what was wrong added
Actions #8

Updated by okurz over 3 years ago

  • Related to action #89560: Add alert for blocked gitlab account when users are unable to save/commit needles added
Actions #9

Updated by mkittler over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #10

Updated by mkittler over 3 years ago

SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/469

Not sure what else to do (see SR description).

Actions #11

Updated by openqa_review over 3 years ago

  • Due date set to 2021-04-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by mkittler over 3 years ago

  • Status changed from In Progress to Feedback

I'm waiting for feedback from the team.

Actions #13

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

MR was merged and as effective on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=19&tab=alert . ilausuch has also confirmed that the message is clear to him.

Actions

Also available in: Atom PDF