Project

General

Profile

Actions

action #132437

closed

Ensure everybody in SUSE QE Tools knows how to silence alerts in various monitoring systems size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-07-07
Due date:
2023-10-25
% Done:

0%

Estimated time:
Tags:

Description

Motivation

from retro 2023-07-07 we identified that we had "many alerts recently". One point we need to ensure is that everybody in SUSE QE Tools knows how to silence alerts in various monitoring systems in particular when alerts are related to the current work we do so that others are not looking into alerts assuming they would be unhandled.

Acceptance criteria

  • AC1: The majority of SUSE QE Tools knows to silence alerts in grafana, zabbix, gitlab CI, openqa-logwarn, openQA "unknown issues" messages, etc.

Suggestions

  • In a common team meeting go with the team over all systems mentioned in AC1 and shows how it works and clarify questions
  • As needed extend Alert handling as documented in the wiki or the salt states README
Actions #1

Updated by okurz over 1 year ago

  • Subject changed from Ensure everybody in SUSE QE Tools knows how to silence alerts in various monitoring systems to Ensure everybody in SUSE QE Tools knows how to silence alerts in various monitoring systems size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by livdywan over 1 year ago

  • Description updated (diff)
  • Status changed from Workable to In Progress
  • Assignee set to livdywan

Despite the e.g. the wiki pretty much assumes Grafana is the only place where we see alerts so I'll start by adding other alerts here.

Actions #3

Updated by livdywan over 1 year ago

livdywan wrote in #note-2:

Despite the e.g. the wiki pretty much assumes Grafana is the only place where we see alerts so I'll start by adding other alerts here.

Not actually sure myself how to silence/pause Munin, Zabbix, Unknown issue or logwarn alerts. Let's see if others know

Actions #4

Updated by openqa_review over 1 year ago

  • Due date set to 2023-10-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by livdywan over 1 year ago

  • Description updated (diff)
Actions #6

Updated by livdywan over 1 year ago

I added a note on suppressing problems in Zabbix. Unfortunately I couldn't really test it yet.

Actions #7

Updated by livdywan over 1 year ago

Open questions:

  • How should openqa-logwarn issues be silenced?
    • Or maybe we actually document that we don't really do that? Since we might rather address these e.g. by proposing a change to logging in relevant components?
  • How should "Unknown issue" emails be silenced?
    • By filing and adding a ticket to the affected job?
Actions #8

Updated by okurz over 1 year ago

livdywan wrote in #note-7:

Open questions:

  • How should openqa-logwarn issues be silenced?
    • Or maybe we actually document that we don't really do that? Since we might rather address these e.g. by proposing a change to logging in relevant components?

By adding "known issues" to https://github.com/os-autoinst/openqa-logwarn/blob/master/logwarn_openqa

  • How should "Unknown issue" emails be silenced?
    • By filing and adding a ticket to the affected job?

yes

Actions #9

Updated by livdywan over 1 year ago

okurz wrote in #note-8:

By adding "known issues" to https://github.com/os-autoinst/openqa-logwarn/blob/master/logwarn_openqa

That's the part I couldn't remember. There isn't a config file :-D We should document this properly.

https://github.com/os-autoinst/openqa-logwarn/pull/46

Actions #10

Updated by livdywan over 1 year ago

  • Status changed from In Progress to Feedback

livdywan wrote in #note-9:

https://github.com/os-autoinst/openqa-logwarn/pull/46

Merged. Now to confirm that the team understands what's documented in our wiki. Definitely had some insightful conversations this week about things that weren't clear before.

Actions #11

Updated by livdywan over 1 year ago

As we realized it's confusing to use unknown and unreviewed issues interchangeably I'm attempting to rectify that, either by always talking about unreviewed as in the email subject or unknown. Let's see if others have strong opinions on this one: https://github.com/os-autoinst/scripts/pull/266

Actions #12

Updated by livdywan over 1 year ago

  • Status changed from Feedback to Resolved

Everyone is somewhat comfortable with the updates. And let's remember to update the steps in case we find any gaps later!

Actions

Also available in: Atom PDF