action #175836
closed[alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:S
0%
Description
Observation¶
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-96&from=2025-01-15T02:00:38.153Z&to=2025-01-15T20:10:14.812Z&var-host_disks=$__all
&
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-96&from=2025-01-18T20:48:41.457Z&to=2025-01-19T14:58:18.116Z&var-host_disks=$__all
This was during the weekly maintenance reboot when OSD can reboot but planned automatic reboots shouldn't trigger alerts. There shouldn't be reports about 350 broken workers for an hour during reboot.
Suggestions¶
Updated by okurz about 1 month ago
- Subject changed from [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot to [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:S
- Description updated (diff)
- Status changed from New to Workable
- Assignee set to robert.richardson
Updated by robert.richardson about 1 month ago
- Status changed from Workable to Blocked
I've silenced the alert for now.
Blocking on: https://progress.opensuse.org/issues/162296
Updated by okurz about 1 month ago
robert.richardson wrote in #note-2:
I've silenced the alert for now.
Blocking on: https://progress.opensuse.org/issues/162296
Then please include the according ticket relation
Updated by gpuliti about 1 month ago
- Blocked by action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S added
Updated by okurz about 1 month ago
- Status changed from Blocked to Workable
- Priority changed from High to Urgent
multiple alerts so I assume the silence is not effective.
Updated by robert.richardson about 1 month ago
Updated by okurz about 1 month ago
Updated by robert.richardson about 1 month ago ยท Edited
okurz wrote in #note-8:
robert.richardson wrote in #note-6:
The silence was only active for a very short time period, sorry about that. Does it still make sense to reactivate it now as #162296 and #175956 are resolved ?
Well, did the alert trigger the past days or not?
yes, though not in the same way as the two examples within the description, where hundreds of workers failed over a longer period of time.
So i'll reactivate it then i guess. Looking at the third suggestion Remove silence and ensure we are back to good during weekly OSD reboots
, i'll set the timeout to expire shortly before that timeframe. When is the next weekly reboot presumably going to happen, Sunday morning as well?
Updated by okurz about 1 month ago
- Priority changed from Urgent to High
robert.richardson wrote in #note-9:
So i'll reactivate it then i guess. Looking at the third suggestion
Remove silence and ensure we are back to good during weekly OSD reboots
, i'll set the timeout to expire shortly before that timeframe.
I am not sure for what purpose
When is the next weekly reboot presumably going to happen, Sunday morning as well?
Yes, if necessary for package upgrades, e.g. kernel upgrade
Updated by robert.richardson about 1 month ago
- Status changed from Workable to Resolved
Had a short chat with @okurz, we decided to unsilence the alert. Resolving the ticket