action #174316
closed
[o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S
Added by okurz 3 months ago.
Updated about 2 months ago.
Category:
Regressions/Crashes
- Copied from action #174313: [o3][zabbix][alert] / and /var/tmp: "Disk space is low and might be full in 7d (used > 85%)" since 2024-12-11 06:50 size:S added
- Related to action #40196: [monitoring] monitor internal port 9526, port 80, external port 443 accessibility of o3 and response times size:M added
- Subject changed from [o3][zabbix][alert] warning about depleting storage space but no email? to [o3][zabbix][alert] warning about depleting storage space but no email? size:S
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from High to Normal
I've checked the /var/tmp
but it seems everything is fine:
ls -lahR
in /var/tmp:
ariel:/var/tmp # ls -lahR
.:
total 60K
drwxrwxrwt 7 root root 36K Dec 31 14:31 .
drwxr-xr-x 11 root root 4.0K Oct 15 18:14 ..
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL
./systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
./systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
./systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
./systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
./systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
differently we can read on zabbix:
ariel.dmz-prg2.suse.org /var/tmp: Used space 4s 9.27 GB
- Subject changed from [o3][zabbix][alert] warning about depleting storage space but no email? size:S to [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S
- Priority changed from Normal to High
Let's raise prio and block both #174916 and #174313 since this makes the current settings mostly unhelpful. We need to ensure we are seeing alerts here.
- Assignee set to jbaier_cz
- Status changed from Workable to In Progress
- Status changed from In Progress to Resolved
I can confirm e-mail alerts are working as already documented in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Monitoring
The current configuration is to receive an e-mail for a problem if:
- The problem is there for at least for 15 minutes
- The problem severity is at least Average (i.e. Average, High or Critical); we are not informed about Information and Warning via e-mail
If we want to reevaluated those rules an SD ticket is needed to adjust the settings, see #132752 for additional info.
Note: for the /var/tmp
problem in comment 5 see #174313#note-16
- Status changed from Resolved to Feedback
The problem with the load average is a bit more complicated though.
We got some emails that the load is over 4, but looking at the graphs the load is over 4 very often, and was going over 40 sometimes.
As the threshold for warning and critical is currently 4 and 5, we would expect emails all the time. That's why we thought there might be a sporadic email problem.
- Related to action #174916: [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: S added
- Status changed from Feedback to Resolved
I see, that's probably something we should handle in #174916. The problem in question can be quite easily answered. The trigger which is causing the e-mail to be sent is Load average is too high which is severity Average. It is triggered by macro LOAD_AVG_PER_CPU.MAX.WARN
(the current value is 4); there is also LOAD_AVG_PER_CPU.MAX.CRIT
(current value 5) which is currently not used in any trigger.
So there is really only one threshold which will trigger the problem and eventually sends the e-mail. Please note the load needs to be over 40 for 5 minute to trigger a problem (and then for additional 15 minutes to trigger e-mail notification). The threshold is 4 and the machine has 10 cores, hence 40 is the magic number for the load. The e-mail subject also states it is "per CPU load" over 4 (i.e. over 40 in our case).
- Related to action #175210: [o3][zabbix] reconsider e-mail notification settings size:S added
Also available in: Atom
PDF