action #176763
closed[alert] Flaky broken workers alert size:S
0%
Description
Observation¶
The alert was firing from 2025-02-05T22:46Z to 2025-02-05T22:51Z and from 2025-02-06T22:56Z to 2025-02-06T23:26Z and from 2025-02-07T01:46Z to 2025-02-07T01:56Z. So it was firing three times around midnight.
We also saw this alert before those occurrences recently, see #175836.
See https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-96&from=2025-02-05T11%3A02%3A48.013Z&to=2025-02-07T09%3A23%3A58.259Z&timezone=utc&var-host_disks=%24__all for the monitoring data.
Rollback actions¶
- Remove alert silence alertname=Broken workers alert
Updated by ybonatakis 3 months ago
I silent the alert for the period of 14d as it keep coming. Last event http://monitor.qa.suse.de/goto/h3uAGBFHg?orgId=1
Updated by ybonatakis 3 months ago
Many comments between that timeframe
Feb 07 22:13:39 openqa openqa-websockets-daemon[17853]: [debug] [pid:17853] Worker 3054 rejected job(s) 16698642: The average load (16.76 22.09 30.50) is exceeding the configured threshold of 16. The worker will temporarily not accept new jobs until the load is lower again.
I checked the websocket with journalctl -u openqa-websockets.service --since "2025-02-07 22:12:51" --until "2025-02-07 22:43:18"
Updated by jbaier_cz 3 months ago
- Related to action #163394: Consider extending our logging of broken workers in grafana (Better understand "Broken workers alert" retroactively) added
Updated by jbaier_cz 3 months ago
I tried to gather some more info, but failed to obtain anything useful. I followed the suggestion from #176763#note-2 and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1369. I feel that in this case I can't do much more and that #163394 could be helpful here.
Updated by jbaier_cz 3 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1374 should handle the wrongly reported value
Updated by jbaier_cz 3 months ago
I recreated the alert and prepared https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1382. That will hopefully provision the alert correctly this time.