action #132437
closed
Ensure everybody in SUSE QE Tools knows how to silence alerts in various monitoring systems size:M
Added by okurz over 1 year ago.
Updated about 1 year ago.
Description
Motivation¶
from retro 2023-07-07 we identified that we had "many alerts recently". One point we need to ensure is that everybody in SUSE QE Tools knows how to silence alerts in various monitoring systems in particular when alerts are related to the current work we do so that others are not looking into alerts assuming they would be unhandled.
Acceptance criteria¶
- AC1: The majority of SUSE QE Tools knows to silence alerts in grafana, zabbix, gitlab CI, openqa-logwarn, openQA "unknown issues" messages, etc.
Suggestions¶
- In a common team meeting go with the team over all systems mentioned in AC1 and shows how it works and clarify questions
- As needed extend Alert handling as documented in the wiki or the salt states README
- Subject changed from Ensure everybody in SUSE QE Tools knows how to silence alerts in various monitoring systems to Ensure everybody in SUSE QE Tools knows how to silence alerts in various monitoring systems size:M
- Description updated (diff)
- Status changed from New to Workable
- Description updated (diff)
- Status changed from Workable to In Progress
- Assignee set to livdywan
Despite the e.g. the wiki pretty much assumes Grafana is the only place where we see alerts so I'll start by adding other alerts here.
livdywan wrote in #note-2:
Despite the e.g. the wiki pretty much assumes Grafana is the only place where we see alerts so I'll start by adding other alerts here.
Not actually sure myself how to silence/pause Munin, Zabbix, Unknown issue or logwarn alerts. Let's see if others know
- Due date set to 2023-10-25
Setting due date based on mean cycle time of SUSE QE Tools
- Description updated (diff)
Open questions:
- How should openqa-logwarn issues be silenced?
- Or maybe we actually document that we don't really do that? Since we might rather address these e.g. by proposing a change to logging in relevant components?
- How should "Unknown issue" emails be silenced?
- By filing and adding a ticket to the affected job?
- Status changed from In Progress to Feedback
As we realized it's confusing to use unknown and unreviewed issues interchangeably I'm attempting to rectify that, either by always talking about unreviewed as in the email subject or unknown. Let's see if others have strong opinions on this one: https://github.com/os-autoinst/scripts/pull/266
- Status changed from Feedback to Resolved
Everyone is somewhat comfortable with the updates. And let's remember to update the steps in case we find any gaps later!
Also available in: Atom
PDF