Project

General

Profile

Actions

action #131150

closed

coordination #132275: [epic] Better o3 monitoring

Add alarms for partition usage on o3 size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-06-20
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

From https://mailman.suse.de/mlarch/SuSE/o3-admins/2023/o3-admins.2023.06/msg00042.html : We have received an alert message by munin about /assets on o3 being 92% full. I think 92% (right now even increased to 93% on o3) is alarming and we should have been noticed about from zabbix where likely the old alarm thresholds were not migrated. We should ensure that there is sufficient alerting. We could go with munin but I guess for something as low-level as disk usage zabbix should be easy enough to use.

Acceptance criteria

  • AC1: A SUSE-IT maintained monitoring solution will alert us if /assets exceeds 90% usage

Suggestions


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #132218: Conduct lessons learned for "openQA is not accessible" on 2023-07-02Resolvedokurz2023-07-02

Actions
Related to openQA Infrastructure (public) - action #132815: [alert][flaky][o3] Multiple flaky zabbix alerts related to o3Resolvedjbaier_cz2023-07-16

Actions
Copied from openQA Infrastructure (public) - action #131147: Reduce /assets usage on o3Resolvedokurz2023-06-20

Actions
Actions #1

Updated by okurz over 1 year ago

Actions #2

Updated by okurz over 1 year ago

  • Subject changed from Add alarms for partition usage on o3 to Add alarms for partition usage on o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz over 1 year ago

  • Parent task set to #132275
Actions #4

Updated by okurz over 1 year ago

  • Related to action #132218: Conduct lessons learned for "openQA is not accessible" on 2023-07-02 added
Actions #5

Updated by okurz over 1 year ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

As #132278 is being worked on let's wait for that first

Actions #6

Updated by jbaier_cz over 1 year ago

While dealing with the http response alerting, I found out several things:

There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).

Actions #7

Updated by okurz over 1 year ago

  • Related to action #132815: [alert][flaky][o3] Multiple flaky zabbix alerts related to o3 added
Actions #8

Updated by okurz over 1 year ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)
Actions #9

Updated by livdywan over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

jbaier_cz wrote:

There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).

I'm grabbing the ticket. Will check with Jiri

Actions #10

Updated by livdywan over 1 year ago

  • Status changed from In Progress to Feedback

cdywan wrote:

jbaier_cz wrote:

There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).

I'm grabbing the ticket. Will check with Jiri

So we talked about checking alerts in the web UI. We do get emails like Problem: /assets Disk space is low (used > 80%) now. Apparently they used to be silenced for unknown reasons.
Similar to Grafana it's also possible to filter by severity, timeframe and other fields - the UX seems to assume a massive low-res screen so it may not be super easy to get into.
We can actually decide what kind of notifications we're getting emails for like critical or higher, or higher or higher (think in terms of log levels here). I feel like for now we should monitor what we're getting since I don't have a super good idea of what would make sense yet.

Actions #11

Updated by livdywan over 1 year ago

  • Status changed from Feedback to Resolved

There's some overlap with #132815 here. It's basically solved, we have limits configured and can also adjust them (and if anyone can't they may still need to file a SD ticket) hence closing.

Actions

Also available in: Atom PDF