action #80538
closedflaky and misleading alerts about "openQA minion workers alert" as well as "Minion Jobs alert"
0%
Description
Observation¶
On 2020-11-26 and 2020-11-27 there were the following alerts and automatic ok-messages:
- 2020-11-26 2045Z: [Alerting] openQA minion workers alert
- 2020-11-26 2046Z: [OK] openQA minion workers alert
- 2020-11-26 2106Z: [OK] openqaworker8: Minion Jobs alert
- 2020-11-27 0140Z: [Alerting] QA-Power8-4-kvm: Minion Jobs alert
- 2020-11-27 0255Z: [OK] QA-Power8-4-kvm: Minion Jobs alert
I assume no one did anything about these alerts to go to "OK" again.
the alert "openQA minion workers alert" has the description:
Minion workers down. Check systemd services on the openQA host
and the alert "$worker: Minion Jobs alert" has the description:
to remove all failed jobs on the machine ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ```
Problems¶
- meaning of alert is unclear
- alerts should be stable, i.e. not flaky
- description does not explain enough so that team members would understand what needs to be done and what the meaning is
Acceptance criteria¶
- AC1: Multiple team members can confirm that updated name and/or description explains them what to do
- AC2: Allow at least one other team member to handle the alert based on documentation
- AC3: The alerts are stable and only persist when there is a problem
Suggestions¶
- Look at the o3 minion dashboard
- Locate the alert on grafana
- Understand what the data source for the alert is and how "minion jobs" are involved
- Research what is the expected state for "minion jobs"
- Extend the description to explain, optionally reference this ticket
- Consider updating the name of the alert as well
- Stabilize alert
Updated by livdywan over 3 years ago
- Description updated (diff)
- Priority changed from High to Normal
I lowered the Priority since I've not seen it at all since. Determining where the actual definition is is part of the ticket, so if there is something that was deleted or disabled I wouldn't know. But I added the obvious places to check.
Updated by okurz over 3 years ago
- Description updated (diff)
We discussed that what you wrote as "acceptance criteria" are suggestions which we already have. I changed it back to be real criteria but kept your addition. The project https://gitlab.suse.de/openqa/grafana-webhook-actions does not contain any related alert definitions.
Updated by livdywan over 3 years ago
[Alerting] openQA minion workers alert
Minion workers down. Check systemd services on the openQA host
Metric name
SumValue
0.984
Saw this alert, followed by an [OK]
. I don't know what 0.984
tells me here. But it seems like it's still flaky since I didn't see a ticket or chat conversations about someone having fixed problems. Please correct me if I'm wrong here.
Updated by livdywan over 3 years ago
Metric name
SumValue
0.995
As seen today, also followed by an OK shortly after, like the previous one. On a side note the email didn't render properly in TB either.
Updated by livdywan over 3 years ago
Can anyone remind me why #78061 is not a duplicate of this?
Updated by okurz over 3 years ago
- Has duplicate action #78061: [Alerting] openQA minion workers alert - alert turned "OK" again after 20 minutes and we don't know what was wrong added
Updated by livdywan over 3 years ago
Discussion during the weekly:
- cache service down? possibly the gru service down?
- Hypothesis, influxdb
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=17&tab=alert
- @mkittler claims to have an idea and volunteered to pick it up
- https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/CacheService/Controller/Influxdb.pm
- https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebAPI/Controller/Admin/Influxdb.pm
Updated by okurz over 3 years ago
- Related to action #89560: Add alert for blocked gitlab account when users are unable to save/commit needles added
Updated by mkittler over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler over 3 years ago
SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/469
Not sure what else to do (see SR description).
Updated by openqa_review over 3 years ago
- Due date set to 2021-04-14
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 3 years ago
- Status changed from In Progress to Feedback
I'm waiting for feedback from the team.
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved
MR was merged and as effective on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=19&tab=alert . ilausuch has also confirmed that the message is clear to him.