action #131150
closedcoordination #132275: [epic] Better o3 monitoring
Add alarms for partition usage on o3 size:M
0%
Description
Motivation¶
From https://mailman.suse.de/mlarch/SuSE/o3-admins/2023/o3-admins.2023.06/msg00042.html : We have received an alert message by munin about /assets on o3 being 92% full. I think 92% (right now even increased to 93% on o3) is alarming and we should have been noticed about from zabbix where likely the old alarm thresholds were not migrated. We should ensure that there is sufficient alerting. We could go with munin but I guess for something as low-level as disk usage zabbix should be easy enough to use.
Acceptance criteria¶
- AC1: A SUSE-IT maintained monitoring solution will alert us if /assets exceeds 90% usage
Suggestions¶
- Login to https://zabbix.nue.suse.com/ and play around until you have an alert for o3 partition usage or ask Eng-Infra to bring back what they likely still store in some of their git repos regarding partition usage alerts from their former icinga/nagios instance
- https://zabbix.nue.suse.com/zabbix.php?show=1&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view&hostids%5B%5D=10855 if that link works shows me two problems, e.g. that the zabbix agent is not available for months. This might be the first thing to look into
Updated by okurz over 1 year ago
- Copied from action #131147: Reduce /assets usage on o3 added
Updated by okurz over 1 year ago
- Subject changed from Add alarms for partition usage on o3 to Add alarms for partition usage on o3 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Related to action #132218: Conduct lessons learned for "openQA is not accessible" on 2023-07-02 added
Updated by okurz over 1 year ago
- Status changed from Workable to Blocked
- Assignee set to okurz
As #132278 is being worked on let's wait for that first
Updated by jbaier_cz over 1 year ago
While dealing with the http response alerting, I found out several things:
- The needed item is monitored: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=323610;
- notifications in general should be covered by #132278;
- corresponding triggers are provided by (so far inaccessible) template together with a macro
$VFS.FS.PUSED.MAX.WARN
which contains the threshold for the trigger.
There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).
Updated by okurz over 1 year ago
- Related to action #132815: [alert][flaky][o3] Multiple flaky zabbix alerts related to o3 added
Updated by okurz over 1 year ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
Updated by livdywan over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
jbaier_cz wrote:
There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).
I'm grabbing the ticket. Will check with Jiri
Updated by livdywan over 1 year ago
- Status changed from In Progress to Feedback
cdywan wrote:
jbaier_cz wrote:
There is already a SD-127003 ticket requesting the access to view this template (which should also make the macro value visible).
I'm grabbing the ticket. Will check with Jiri
So we talked about checking alerts in the web UI. We do get emails like Problem: /assets Disk space is low (used > 80%) now. Apparently they used to be silenced for unknown reasons.
Similar to Grafana it's also possible to filter by severity, timeframe and other fields - the UX seems to assume a massive low-res screen so it may not be super easy to get into.
We can actually decide what kind of notifications we're getting emails for like critical or higher, or higher or higher (think in terms of log levels here). I feel like for now we should monitor what we're getting since I don't have a super good idea of what would make sense yet.
Updated by livdywan over 1 year ago
- Status changed from Feedback to Resolved
There's some overlap with #132815 here. It's basically solved, we have limits configured and can also adjust them (and if anyone can't they may still need to file a SD ticket) hence closing.