action #174316
closed[o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S
0%
Description
Observation¶
it shows
2024-12-11 06:50:26 Warning PROBLEM ariel.dmz-prg2.suse.org /var/tmp: Disk space is low and might be full in 7d (used > 85%) 1d 9h 40m No Application: Filesystem /var/tmp
2024-12-11 06:50:23 Warning PROBLEM ariel.dmz-prg2.suse.org /: Disk space is low and might be full in 7d (used > 85%) 1d 9h 40m No Application: Filesystem /
which should be handled in #174313 but I don't recall that we have received any alert notification, e.g. email to o3-admins@suse.de . Ensure we get emails on alerts
Suggestions¶
- Look into zabbix configuration options
- Check if we would only get emails for critical, not "warning"
- Crosscheck if this is a regression or if we never got emails
Updated by okurz 3 months ago
- Copied from action #174313: [o3][zabbix][alert] / and /var/tmp: "Disk space is low and might be full in 7d (used > 85%)" since 2024-12-11 06:50 size:S added
Updated by okurz 3 months ago
- Related to action #40196: [monitoring] monitor internal port 9526, port 80, external port 443 accessibility of o3 and response times size:M added
Updated by gpuliti 2 months ago · Edited
I've checked the /var/tmp
but it seems everything is fine:
ls -lahR
in /var/tmp:
ariel:/var/tmp # ls -lahR
.:
total 60K
drwxrwxrwt 7 root root 36K Dec 31 14:31 .
drwxr-xr-x 11 root root 4.0K Oct 15 18:14 ..
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5
drwx------ 3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL
./systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
./systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
./systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
./systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
./systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root 36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp
./systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..
differently we can read on zabbix:
ariel.dmz-prg2.suse.org /var/tmp: Used space 4s 9.27 GB
Updated by livdywan about 2 months ago
- Subject changed from [o3][zabbix][alert] warning about depleting storage space but no email? size:S to [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S
- Priority changed from Normal to High
Updated by jbaier_cz about 2 months ago
- Status changed from Workable to In Progress
Updated by jbaier_cz about 2 months ago
- Status changed from In Progress to Resolved
I can confirm e-mail alerts are working as already documented in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Monitoring
The current configuration is to receive an e-mail for a problem if:
- The problem is there for at least for 15 minutes
- The problem severity is at least Average (i.e. Average, High or Critical); we are not informed about Information and Warning via e-mail
If we want to reevaluated those rules an SD ticket is needed to adjust the settings, see #132752 for additional info.
Note: for the /var/tmp
problem in comment 5 see #174313#note-16
Updated by tinita about 2 months ago · Edited
- Status changed from Resolved to Feedback
The problem with the load average is a bit more complicated though.
We got some emails that the load is over 4, but looking at the graphs the load is over 4 very often, and was going over 40 sometimes.
As the threshold for warning and critical is currently 4 and 5, we would expect emails all the time. That's why we thought there might be a sporadic email problem.
Updated by jbaier_cz about 2 months ago
- Related to action #174916: [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: S added
Updated by jbaier_cz about 2 months ago
- Status changed from Feedback to Resolved
I see, that's probably something we should handle in #174916. The problem in question can be quite easily answered. The trigger which is causing the e-mail to be sent is Load average is too high which is severity Average. It is triggered by macro LOAD_AVG_PER_CPU.MAX.WARN
(the current value is 4); there is also LOAD_AVG_PER_CPU.MAX.CRIT
(current value 5) which is currently not used in any trigger.
So there is really only one threshold which will trigger the problem and eventually sends the e-mail. Please note the load needs to be over 40 for 5 minute to trigger a problem (and then for additional 15 minutes to trigger e-mail notification). The threshold is 4 and the machine has 10 cores, hence 40 is the magic number for the load. The e-mail subject also states it is "per CPU load" over 4 (i.e. over 40 in our case).
Updated by jbaier_cz about 2 months ago
- Related to action #175210: [o3][zabbix] reconsider e-mail notification settings size:S added