action #174316: [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #174316

closed

[o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S

Added by okurz 4 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

High

Assignee:

jbaier_cz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-12-12

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

From https://zabbix.nue.suse.com/zabbix.php?show=1&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view

it shows

2024-12-11 06:50:26                                Warning                PROBLEM                ariel.dmz-prg2.suse.org        /var/tmp: Disk space is low and might be full in 7d (used > 85%)        1d 9h 40m        No                Application: Filesystem /var/tmp
2024-12-11 06:50:23                                Warning                PROBLEM                ariel.dmz-prg2.suse.org        /: Disk space is low and might be full in 7d (used > 85%)        1d 9h 40m        No                Application: Filesystem /

which should be handled in #174313 but I don't recall that we have received any alert notification, e.g. email to o3-admins@suse.de . Ensure we get emails on alerts

Suggestions¶

Look into zabbix configuration options
Check if we would only get emails for critical, not "warning"
Crosscheck if this is a regression or if we never got emails

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz 4 months ago

Copied from action #174313: [o3][zabbix][alert] / and /var/tmp: "Disk space is low and might be full in 7d (used > 85%)" since 2024-12-11 06:50 size:S added

Actions

Copy link

Updated by okurz 4 months ago

Related to action #40196: [monitoring] monitor internal port 9526, port 80, external port 443 accessibility of o3 and response times size:M added

Actions

Copy link

Updated by okurz 4 months ago

Subject changed from [o3][zabbix][alert] warning about depleting storage space but no email? to [o3][zabbix][alert] warning about depleting storage space but no email? size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 4 months ago

Priority changed from High to Normal

Actions

Copy link

Updated by gpuliti 3 months ago · Edited

I've checked the /var/tmp but it seems everything is fine:

ls -lahR in /var/tmp:

ariel:/var/tmp # ls -lahR
.:
total 60K
drwxrwxrwt  7 root root  36K Dec 31 14:31 .
drwxr-xr-x 11 root root 4.0K Oct 15 18:14 ..
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL

./systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

./systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

./systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

./systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

./systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

differently we can read on zabbix:

ariel.dmz-prg2.suse.org /var/tmp: Used space    4s  9.27 GB

Actions

Copy link

Updated by livdywan 3 months ago

Subject changed from [o3][zabbix][alert] warning about depleting storage space but no email? size:S to [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S
Priority changed from Normal to High

Let's raise prio and block both #174916 and #174313 since this makes the current settings mostly unhelpful. We need to ensure we are seeing alerts here.

Actions

Copy link

Updated by jbaier_cz 3 months ago

Assignee set to jbaier_cz

Actions

Copy link

Updated by jbaier_cz 3 months ago

Status changed from Workable to In Progress

Actions

Copy link

Updated by jbaier_cz 3 months ago

Status changed from In Progress to Resolved

I can confirm e-mail alerts are working as already documented in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Monitoring

The current configuration is to receive an e-mail for a problem if:

The problem is there for at least for 15 minutes
The problem severity is at least Average (i.e. Average, High or Critical); we are not informed about Information and Warning via e-mail

If we want to reevaluated those rules an SD ticket is needed to adjust the settings, see #132752 for additional info.

Note: for the /var/tmp problem in comment 5 see #174313#note-16

Actions

Copy link

#10

Updated by tinita 3 months ago · Edited

Status changed from Resolved to Feedback

The problem with the load average is a bit more complicated though.
We got some emails that the load is over 4, but looking at the graphs the load is over 4 very often, and was going over 40 sometimes.
As the threshold for warning and critical is currently 4 and 5, we would expect emails all the time. That's why we thought there might be a sporadic email problem.

Actions

Copy link

#11

Updated by jbaier_cz 3 months ago

Related to action #174916: [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: S added

Actions

Copy link

#12

Updated by jbaier_cz 3 months ago

Status changed from Feedback to Resolved

I see, that's probably something we should handle in #174916. The problem in question can be quite easily answered. The trigger which is causing the e-mail to be sent is Load average is too high which is severity Average. It is triggered by macro LOAD_AVG_PER_CPU.MAX.WARN (the current value is 4); there is also LOAD_AVG_PER_CPU.MAX.CRIT (current value 5) which is currently not used in any trigger.

So there is really only one threshold which will trigger the problem and eventually sends the e-mail. Please note the load needs to be over 40 for 5 minute to trigger a problem (and then for additional 15 minutes to trigger e-mail notification). The threshold is 4 and the machine has 10 cores, hence 40 is the magic number for the load. The e-mail subject also states it is "per CPU load" over 4 (i.e. over 40 in our case).

Actions

Copy link

#13

Updated by jbaier_cz 3 months ago

Related to action #175210: [o3][zabbix] reconsider e-mail notification settings size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Tags

Custom queries

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

action #174316

[o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S

Observation¶

Suggestions¶

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by gpuliti 3 months ago · Edited

Updated by livdywan 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by tinita 3 months ago · Edited

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago