Project

General

Profile

Actions

action #174316

closed

[o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S

Added by okurz 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-12-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

From https://zabbix.nue.suse.com/zabbix.php?show=1&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view

it shows

2024-12-11 06:50:26                                Warning                PROBLEM                ariel.dmz-prg2.suse.org        /var/tmp: Disk space is low and might be full in 7d (used > 85%)        1d 9h 40m        No                Application: Filesystem /var/tmp
2024-12-11 06:50:23                                Warning                PROBLEM                ariel.dmz-prg2.suse.org        /: Disk space is low and might be full in 7d (used > 85%)        1d 9h 40m        No                Application: Filesystem /

which should be handled in #174313 but I don't recall that we have received any alert notification, e.g. email to o3-admins@suse.de . Ensure we get emails on alerts

Suggestions

  • Look into zabbix configuration options
  • Check if we would only get emails for critical, not "warning"
  • Crosscheck if this is a regression or if we never got emails

Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #40196: [monitoring] monitor internal port 9526, port 80, external port 443 accessibility of o3 and response times size:MResolvedokurz2018-08-23

Actions
Related to openQA Infrastructure (public) - action #174916: [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: SResolvedgpuliti2024-12-312025-01-25

Actions
Related to openQA Infrastructure (public) - action #175210: [o3][zabbix] reconsider e-mail notification settings size:SResolvedrobert.richardson2024-12-12

Actions
Copied from openQA Infrastructure (public) - action #174313: [o3][zabbix][alert] / and /var/tmp: "Disk space is low and might be full in 7d (used > 85%)" since 2024-12-11 06:50 size:SResolvedmkittler

Actions
Actions #1

Updated by okurz 3 months ago

  • Copied from action #174313: [o3][zabbix][alert] / and /var/tmp: "Disk space is low and might be full in 7d (used > 85%)" since 2024-12-11 06:50 size:S added
Actions #2

Updated by okurz 3 months ago

  • Related to action #40196: [monitoring] monitor internal port 9526, port 80, external port 443 accessibility of o3 and response times size:M added
Actions #3

Updated by okurz 3 months ago

  • Subject changed from [o3][zabbix][alert] warning about depleting storage space but no email? to [o3][zabbix][alert] warning about depleting storage space but no email? size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 3 months ago

  • Priority changed from High to Normal
Actions #5

Updated by gpuliti 2 months ago · Edited

I've checked the /var/tmp but it seems everything is fine:

ls -lahR in /var/tmp:

ariel:/var/tmp # ls -lahR
.:
total 60K
drwxrwxrwt  7 root root  36K Dec 31 14:31 .
drwxr-xr-x 11 root root 4.0K Oct 15 18:14 ..
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5
drwx------  3 root root 4.0K Dec 15 03:36 systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL

./systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-chronyd.service-m8x65h/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

./systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-nginx.service-OsXysR/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

./systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-rsyncd.service-zS5ceg/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

./systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-systemd-logind.service-ORBal5/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

./systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL:
total 44K
drwx------ 3 root root 4.0K Dec 15 03:36 .
drwxrwxrwt 7 root root  36K Dec 31 14:31 ..
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 tmp

./systemd-private-6fa2a5564db14377b012191eb1fdf045-zabbix_agentd.service-VqUirL/tmp:
total 8.0K
drwxrwxrwt 2 root root 4.0K Dec 15 03:36 .
drwx------ 3 root root 4.0K Dec 15 03:36 ..

differently we can read on zabbix:

ariel.dmz-prg2.suse.org /var/tmp: Used space    4s  9.27 GB
Actions #6

Updated by livdywan about 2 months ago

  • Subject changed from [o3][zabbix][alert] warning about depleting storage space but no email? size:S to [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S
  • Priority changed from Normal to High

Let's raise prio and block both #174916 and #174313 since this makes the current settings mostly unhelpful. We need to ensure we are seeing alerts here.

Actions #7

Updated by jbaier_cz about 2 months ago

  • Assignee set to jbaier_cz
Actions #8

Updated by jbaier_cz about 2 months ago

  • Status changed from Workable to In Progress
Actions #9

Updated by jbaier_cz about 2 months ago

  • Status changed from In Progress to Resolved

I can confirm e-mail alerts are working as already documented in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Monitoring

The current configuration is to receive an e-mail for a problem if:

  1. The problem is there for at least for 15 minutes
  2. The problem severity is at least Average (i.e. Average, High or Critical); we are not informed about Information and Warning via e-mail

If we want to reevaluated those rules an SD ticket is needed to adjust the settings, see #132752 for additional info.

Note: for the /var/tmp problem in comment 5 see #174313#note-16

Actions #10

Updated by tinita about 2 months ago · Edited

  • Status changed from Resolved to Feedback

The problem with the load average is a bit more complicated though.
We got some emails that the load is over 4, but looking at the graphs the load is over 4 very often, and was going over 40 sometimes.
As the threshold for warning and critical is currently 4 and 5, we would expect emails all the time. That's why we thought there might be a sporadic email problem.

Actions #11

Updated by jbaier_cz about 2 months ago

  • Related to action #174916: [alert][zabbix@suse.de] Problem: Load average is too high (per CPU load over 4 for 5m) size: S added
Actions #12

Updated by jbaier_cz about 2 months ago

  • Status changed from Feedback to Resolved

I see, that's probably something we should handle in #174916. The problem in question can be quite easily answered. The trigger which is causing the e-mail to be sent is Load average is too high which is severity Average. It is triggered by macro LOAD_AVG_PER_CPU.MAX.WARN (the current value is 4); there is also LOAD_AVG_PER_CPU.MAX.CRIT (current value 5) which is currently not used in any trigger.

So there is really only one threshold which will trigger the problem and eventually sends the e-mail. Please note the load needs to be over 40 for 5 minute to trigger a problem (and then for additional 15 minutes to trigger e-mail notification). The threshold is 4 and the machine has 10 cores, hence 40 is the magic number for the load. The e-mail subject also states it is "per CPU load" over 4 (i.e. over 40 in our case).

Actions #13

Updated by jbaier_cz about 2 months ago

  • Related to action #175210: [o3][zabbix] reconsider e-mail notification settings size:S added
Actions

Also available in: Atom PDF