Project

General

Profile

Actions

action #132815

closed

[alert][flaky][o3] Multiple flaky zabbix alerts related to o3

Added by okurz 11 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-07-16
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Multiple emails received, see https://mailman.suse.de/mlarch/SuSE/o3-admins/2023/o3-admins.2023.07/maillist.html

Rollback steps

  • enable "Load average is too high" trigger

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure - action #132278: Basic o3 http response alert on zabbix size:MResolvedjbaier_cz

Actions
Related to openQA Infrastructure - coordination #132275: [epic] Better o3 monitoringBlockedokurz2023-06-07

Actions
Related to openQA Infrastructure - action #131150: Add alarms for partition usage on o3 size:MResolvedlivdywan2023-06-20

Actions
Actions #1

Updated by jbaier_cz 11 months ago

  • Status changed from New to In Progress
  • Assignee set to jbaier_cz
Actions #2

Updated by jbaier_cz 11 months ago

  • Status changed from In Progress to Feedback

After brief investigation I find out that the notification settings inside zabbix were overly verbose. I reconfigured it to alert only on issues which are "Averege" and higher (at this moment, this should include severe filesystem and accessibility problems).

Actions #3

Updated by jbaier_cz 11 months ago

Threshold tweaks are currently blocked by missing permissions SD-127003. I will wait overnight to see if the notification settings was actually enough or if we need to wait for the SD ticket.

Actions #4

Updated by tinita 11 months ago

Currently /assets is over 80%, and we don't get an email, but we probably should, just that the threshold should be a bit higher (cleaning up of assets happens when it's over 80%, so we expect it to be a bit over 80% for a short duration).

Actions #5

Updated by okurz 11 months ago

  • Related to action #132278: Basic o3 http response alert on zabbix size:M added
Actions #6

Updated by okurz 11 months ago

Actions #7

Updated by okurz 11 months ago

  • Related to action #131150: Add alarms for partition usage on o3 size:M added
Actions #8

Updated by jbaier_cz 11 months ago

The email alert threshold is 90%.

Actions #9

Updated by jbaier_cz 11 months ago

jbaier_cz wrote:

The email alert threshold is 90%.

However I can't verify that, because the VFS.FS.PUSED.MAX.CRIT macro with that value is hidden in the not yet visible template behind SD-127003.

Actions #10

Updated by okurz 11 months ago

From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)". Please look up the according threshold that we defined for OSD in grafana and I suggest to apply that here as well, something like in the range of 20-40. 1.5 is way too restrictive for our setup.

Actions #11

Updated by jbaier_cz 11 months ago

  • Description updated (diff)
  • Status changed from Feedback to Blocked

okurz wrote:

From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)".

Hm, I see. So we are hitting also the average trigger for CPU... Blocking this on SD-127003 (to be able to change the threshold). In the mean time, I will deactivate the CPU load trigger warning to suppress the emails.

Actions #12

Updated by tinita 11 months ago

okurz wrote:

From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)". Please look up the according threshold that we defined for OSD in grafana and I suggest to apply that here as well, something like in the range of 20-40. 1.5 is way too restrictive for our setup.

20-40? Note that the current threshold 1.5 is the average per CPU, so 15 for 10 CPUs. Maybe set it to 4? (As soon as we get access).

Actions #13

Updated by jbaier_cz 11 months ago

  • Status changed from Blocked to In Progress

SD ticket resolved, we are no longer blocked.

Actions #14

Updated by jbaier_cz 11 months ago

  • Status changed from In Progress to Resolved

Threshold for alerts were adjusted: CPU load should trigger on 4 times the cpu cores (that should be load over 40 in our case), storage triggers should warn us around 90% of space used, the same holds for available memory and finally, we have some availability triggers. I did not see any alert since this morning, so I am considering this resolved for now. Any additional improvement should be made (and will be made) in other tickets.

Actions

Also available in: Atom PDF