action #175836
closed
[alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:S
Added by okurz about 1 month ago.
Updated about 1 month ago.
Category:
Regressions/Crashes
- Subject changed from [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot to [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:S
- Description updated (diff)
- Status changed from New to Workable
- Assignee set to robert.richardson
- Status changed from Workable to Blocked
- Blocked by action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S added
- Status changed from Blocked to Workable
- Priority changed from High to Urgent
multiple alerts so I assume the silence is not effective.
The silence was only active for a very short time period, sorry about that. Does it still make sense to reactivate it now as #162296 and #175956 are resolved ?
- Description updated (diff)
robert.richardson wrote in #note-6:
The silence was only active for a very short time period, sorry about that. Does it still make sense to reactivate it now as #162296 and #175956 are resolved ?
Well, did the alert trigger the past days or not?
okurz wrote in #note-8:
robert.richardson wrote in #note-6:
The silence was only active for a very short time period, sorry about that. Does it still make sense to reactivate it now as #162296 and #175956 are resolved ?
Well, did the alert trigger the past days or not?
yes, though not in the same way as the two examples within the description, where hundreds of workers failed over a longer period of time.
So i'll reactivate it then i guess. Looking at the third suggestion Remove silence and ensure we are back to good during weekly OSD reboots
, i'll set the timeout to expire shortly before that timeframe. When is the next weekly reboot presumably going to happen, Sunday morning as well?
- Priority changed from Urgent to High
robert.richardson wrote in #note-9:
So i'll reactivate it then i guess. Looking at the third suggestion Remove silence and ensure we are back to good during weekly OSD reboots
, i'll set the timeout to expire shortly before that timeframe.
I am not sure for what purpose
When is the next weekly reboot presumably going to happen, Sunday morning as well?
Yes, if necessary for package upgrades, e.g. kernel upgrade
- Status changed from Workable to Resolved
Had a short chat with @okurz, we decided to unsilence the alert. Resolving the ticket
Also available in: Atom
PDF