action #131150: Add alarms for partition usage on o3 size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #131150

closed

coordination #132275: [epic] Better o3 monitoring

Add alarms for partition usage on o3 size:M

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-06-20

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Motivation¶

From https://mailman.suse.de/mlarch/SuSE/o3-admins/2023/o3-admins.2023.06/msg00042.html : We have received an alert message by munin about /assets on o3 being 92% full. I think 92% (right now even increased to 93% on o3) is alarming and we should have been noticed about from zabbix where likely the old alarm thresholds were not migrated. We should ensure that there is sufficient alerting. We could go with munin but I guess for something as low-level as disk usage zabbix should be easy enough to use.

Acceptance criteria¶

AC1: A SUSE-IT maintained monitoring solution will alert us if /assets exceeds 90% usage

Suggestions¶

Login to https://zabbix.nue.suse.com/ and play around until you have an alert for o3 partition usage or ask Eng-Infra to bring back what they likely still store in some of their git repos regarding partition usage alerts from their former icinga/nagios instance
https://zabbix.nue.suse.com/zabbix.php?show=1&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view&hostids%5B%5D=10855 if that link works shows me two problems, e.g. that the zabbix agent is not available for months. This might be the first thing to look into

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz almost 2 years ago

Copied from action #131147: Reduce /assets usage on o3 added

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from Add alarms for partition usage on o3 to Add alarms for partition usage on o3 size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz almost 2 years ago

Parent task set to #132275

Actions

Copy link

Updated by okurz almost 2 years ago

Related to action #132218: Conduct lessons learned for "openQA is not accessible" on 2023-07-02 added

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from Workable to Blocked
Assignee set to okurz

As #132278 is being worked on let's wait for that first

Actions

Copy link

Updated by jbaier_cz almost 2 years ago

While dealing with the http response alerting, I found out several things:

The needed item is monitored: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=323610;
notifications in general should be covered by #132278;
corresponding triggers are provided by (so far inaccessible) template together with a macro $VFS.FS.PUSED.MAX.WARN which contains the threshold for the trigger.

There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).

Actions

Copy link

Updated by okurz almost 2 years ago

Related to action #132815: [alert][flaky][o3] Multiple flaky zabbix alerts related to o3 added

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)

Actions

Copy link

Updated by livdywan almost 2 years ago

Status changed from Workable to In Progress
Assignee set to livdywan

jbaier_cz wrote:

There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).

I'm grabbing the ticket. Will check with Jiri

Actions

Copy link

#10

Updated by livdywan almost 2 years ago

Status changed from In Progress to Feedback

cdywan wrote:

jbaier_cz wrote:

There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).

I'm grabbing the ticket. Will check with Jiri

So we talked about checking alerts in the web UI. We do get emails like Problem: /assets Disk space is low (used > 80%) now. Apparently they used to be silenced for unknown reasons.
Similar to Grafana it's also possible to filter by severity, timeframe and other fields - the UX seems to assume a massive low-res screen so it may not be super easy to get into.
We can actually decide what kind of notifications we're getting emails for like critical or higher, or higher or higher (think in terms of log levels here). I feel like for now we should monitor what we're getting since I don't have a super good idea of what would make sense yet.

Actions

Copy link

#11

Updated by livdywan almost 2 years ago

Status changed from Feedback to Resolved

There's some overlap with #132815 here. It's basically solved, we have limits configured and can also adjust them (and if anyone can't they may still need to file a SD ticket) hence closing.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #131150

Add alarms for partition usage on o3 size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by jbaier_cz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by livdywan almost 2 years ago