action #176763
closed[alert] Flaky broken workers alert size:S
0%
Description
Observation¶
The alert was firing from 2025-02-05T22:46Z to 2025-02-05T22:51Z and from 2025-02-06T22:56Z to 2025-02-06T23:26Z and from 2025-02-07T01:46Z to 2025-02-07T01:56Z. So it was firing three times around midnight.
We also saw this alert before those occurrences recently, see #175836.
See https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-96&from=2025-02-05T11%3A02%3A48.013Z&to=2025-02-07T09%3A23%3A58.259Z&timezone=utc&var-host_disks=%24__all for the monitoring data.
Rollback actions¶
- Remove alert silence alertname=Broken workers alert
Updated by okurz about 2 months ago
- Priority changed from Normal to High
- Target version set to Ready
Updated by mkittler about 2 months ago
The alert itself makes sense. If there are one or more broken workers consistently for 15 minutes it will alert and that was also the case. Setting the grace period to 30 minutes would have helped to prevent the alerts. Should we just do that?
Updated by ybonatakis about 2 months ago
I silent the alert for the period of 14d as it keep coming. Last event http://monitor.qa.suse.de/goto/h3uAGBFHg?orgId=1
Updated by ybonatakis about 2 months ago
Many comments between that timeframe
Feb 07 22:13:39 openqa openqa-websockets-daemon[17853]: [debug] [pid:17853] Worker 3054 rejected job(s) 16698642: The average load (16.76 22.09 30.50) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.
I checked the websocket with journalctl -u openqa-websockets.service --since "2025-02-07 22:12:51" --until "2025-02-07 22:43:18"
Updated by jbaier_cz about 2 months ago
- Description updated (diff)
- Priority changed from Urgent to High
Mitigation is applied, the alert is ok at this moment, hence lowering the prio back to high.
Updated by jbaier_cz about 2 months ago
- Related to action #163394: Consider extending our logging of broken workers in grafana (Better understand "Broken workers alert" retroactively) added
Updated by jbaier_cz about 2 months ago
I tried to gather some more info, but failed to obtain anything useful. I followed the suggestion from #176763#note-2 and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1369. I feel that in this case I can't do much more and that #163394 could be helpful here.
Updated by jbaier_cz about 2 months ago
- Status changed from In Progress to Feedback
Updated by livdywan about 2 months ago
- Subject changed from [alert] Flaky broken workers alert to [alert] Flaky broken workers alert size:S
Briefly discussed in the estimations
Updated by jbaier_cz about 2 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1374 should handle the wrongly reported value
Updated by jbaier_cz about 2 months ago
- Status changed from Feedback to Blocked
Merged but not deployed, blocked by #177324
Updated by jbaier_cz about 2 months ago
- Status changed from Blocked to In Progress
Updated by jbaier_cz about 2 months ago
I recreated the alert and prepared https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1382. That will hopefully provision the alert correctly this time.
Updated by jbaier_cz about 1 month ago
- Due date deleted (
2025-02-27) - Status changed from In Progress to Resolved
Merged and deployed, panel looks good and already shows some data.