action #97136
closed[alert] multiple unhandled alerts about "broken workers" size:M
0%
Description
Observation¶
There is at least one broken worker for more than 15 minutes. Have a look at
https://openqa.suse.de/admin/workers to find out which one it is (click the help
icon to view the concrete error message).
Metric name
Value
Number of broken workers
15.000
View your Alert rule
http://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=96&orgId=1
Suggestions¶
- Check the yet-another-alert from the start of the week when after weekly rebooting there were reports about broken workers
- Understand https://github.com/os-autoinst/openQA/pull/4122 which is likely to be related
- Look into worker logs
- Prevent the situation that workers are reported as broken too soon
Updated by okurz over 3 years ago
- Copied to action #97139: [alert] multiple unhandled alerts about "malbec: Memory usage alert" size:M added
Updated by okurz about 3 years ago
- Subject changed from [alert] multiple unhandled alerts about "broken workers" to [alert] multiple unhandled alerts about "broken workers" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by dheidler about 3 years ago
- Status changed from Workable to In Progress
- Assignee set to dheidler
Updated by dheidler about 3 years ago
- Priority changed from Urgent to High
The broken wokers alert was covering 5min instead of described 15m, so let's fix that:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/564
Updated by dheidler about 3 years ago
- Status changed from In Progress to Feedback
Reverting my original PR as marking workers as broken when download jobs are piling up as it seems to create more issues than it solves:
https://github.com/os-autoinst/openQA/pull/4144
Updated by dheidler about 3 years ago
Paused the broken workers alert at https://stats.openqa-monitor.qa.suse.de/alerting/list for now
Updated by dheidler about 3 years ago
MR to not count overloaded workers as broken:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/565
Updated by dheidler about 3 years ago
- Status changed from Feedback to Resolved
no new alerts so far.