Project

General

Profile

Actions

action #175836

closed

[alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:S

Added by okurz about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Category:
Regressions/Crashes
Start date:
2025-01-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-96&from=2025-01-15T02:00:38.153Z&to=2025-01-15T20:10:14.812Z&var-host_disks=$__all
&
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-96&from=2025-01-18T20:48:41.457Z&to=2025-01-19T14:58:18.116Z&var-host_disks=$__all

This was during the weekly maintenance reboot when OSD can reboot but planned automatic reboots shouldn't trigger alerts. There shouldn't be reports about 350 broken workers for an hour during reboot.

Suggestions

  • Silence and reference #162296
  • Block on #162296
  • Remove silence and ensure we are back to good during weekly OSD reboots

Related issues 1 (0 open1 closed)

Blocked by openQA Project (public) - action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:SResolveddheidler2024-06-14

Actions
Actions #1

Updated by okurz about 1 month ago

  • Subject changed from [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot to [alert][FIRING:1] (Broken workers alert Salt dZ025mf4z) due to osd reboot, many broken workers size:S
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to robert.richardson
Actions #2

Updated by robert.richardson about 1 month ago

  • Status changed from Workable to Blocked

I've silenced the alert for now.

Blocking on: https://progress.opensuse.org/issues/162296

Actions #3

Updated by okurz about 1 month ago

robert.richardson wrote in #note-2:

I've silenced the alert for now.

Blocking on: https://progress.opensuse.org/issues/162296

Then please include the according ticket relation

Actions #4

Updated by gpuliti about 1 month ago

  • Blocked by action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S added
Actions #5

Updated by okurz about 1 month ago

  • Status changed from Blocked to Workable
  • Priority changed from High to Urgent

multiple alerts so I assume the silence is not effective.

Actions #6

Updated by robert.richardson about 1 month ago

The silence was only active for a very short time period, sorry about that. Does it still make sense to reactivate it now as #162296 and #175956 are resolved ?

Actions #7

Updated by robert.richardson about 1 month ago

  • Description updated (diff)
Actions #8

Updated by okurz about 1 month ago

robert.richardson wrote in #note-6:

The silence was only active for a very short time period, sorry about that. Does it still make sense to reactivate it now as #162296 and #175956 are resolved ?

Well, did the alert trigger the past days or not?

Actions #9

Updated by robert.richardson about 1 month ago ยท Edited

okurz wrote in #note-8:

robert.richardson wrote in #note-6:

The silence was only active for a very short time period, sorry about that. Does it still make sense to reactivate it now as #162296 and #175956 are resolved ?

Well, did the alert trigger the past days or not?

yes, though not in the same way as the two examples within the description, where hundreds of workers failed over a longer period of time.
So i'll reactivate it then i guess. Looking at the third suggestion Remove silence and ensure we are back to good during weekly OSD reboots, i'll set the timeout to expire shortly before that timeframe. When is the next weekly reboot presumably going to happen, Sunday morning as well?

Actions #10

Updated by okurz about 1 month ago

  • Priority changed from Urgent to High

robert.richardson wrote in #note-9:

So i'll reactivate it then i guess. Looking at the third suggestion Remove silence and ensure we are back to good during weekly OSD reboots, i'll set the timeout to expire shortly before that timeframe.

I am not sure for what purpose

When is the next weekly reboot presumably going to happen, Sunday morning as well?

Yes, if necessary for package upgrades, e.g. kernel upgrade

Actions #11

Updated by robert.richardson about 1 month ago

  • Status changed from Workable to Resolved

Had a short chat with @okurz, we decided to unsilence the alert. Resolving the ticket

Actions

Also available in: Atom PDF