action #132815
closed[alert][flaky][o3] Multiple flaky zabbix alerts related to o3
0%
Description
Observation¶
Multiple emails received, see https://mailman.suse.de/mlarch/SuSE/o3-admins/2023/o3-admins.2023.07/maillist.html
Rollback steps¶
- enable "Load average is too high" trigger
Updated by jbaier_cz over 1 year ago
- Status changed from New to In Progress
- Assignee set to jbaier_cz
Updated by jbaier_cz over 1 year ago
- Status changed from In Progress to Feedback
After brief investigation I find out that the notification settings inside zabbix were overly verbose. I reconfigured it to alert only on issues which are "Averege" and higher (at this moment, this should include severe filesystem and accessibility problems).
Updated by jbaier_cz over 1 year ago
Threshold tweaks are currently blocked by missing permissions SD-127003. I will wait overnight to see if the notification settings was actually enough or if we need to wait for the SD ticket.
Updated by tinita over 1 year ago
Currently /assets
is over 80%, and we don't get an email, but we probably should, just that the threshold should be a bit higher (cleaning up of assets happens when it's over 80%, so we expect it to be a bit over 80% for a short duration).
Updated by okurz over 1 year ago
- Related to action #132278: Basic o3 http response alert on zabbix size:M added
Updated by okurz over 1 year ago
- Related to coordination #132275: [epic] Better o3 monitoring added
Updated by okurz over 1 year ago
- Related to action #131150: Add alarms for partition usage on o3 size:M added
Updated by jbaier_cz over 1 year ago
jbaier_cz wrote:
The email alert threshold is 90%.
However I can't verify that, because the VFS.FS.PUSED.MAX.CRIT
macro with that value is hidden in the not yet visible template behind SD-127003.
Updated by okurz over 1 year ago
From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)". Please look up the according threshold that we defined for OSD in grafana and I suggest to apply that here as well, something like in the range of 20-40. 1.5 is way too restrictive for our setup.
Updated by jbaier_cz over 1 year ago
- Description updated (diff)
- Status changed from Feedback to Blocked
okurz wrote:
From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)".
Hm, I see. So we are hitting also the average trigger for CPU... Blocking this on SD-127003 (to be able to change the threshold). In the mean time, I will deactivate the CPU load trigger warning to suppress the emails.
Updated by tinita over 1 year ago
okurz wrote:
From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)". Please look up the according threshold that we defined for OSD in grafana and I suggest to apply that here as well, something like in the range of 20-40. 1.5 is way too restrictive for our setup.
20-40? Note that the current threshold 1.5 is the average per CPU, so 15 for 10 CPUs. Maybe set it to 4? (As soon as we get access).
Updated by jbaier_cz over 1 year ago
- Status changed from Blocked to In Progress
SD ticket resolved, we are no longer blocked.
Updated by jbaier_cz over 1 year ago
- Status changed from In Progress to Resolved
Threshold for alerts were adjusted: CPU load should trigger on 4 times the cpu cores (that should be load over 40 in our case), storage triggers should warn us around 90% of space used, the same holds for available memory and finally, we have some availability triggers. I did not see any alert since this morning, so I am considering this resolved for now. Any additional improvement should be made (and will be made) in other tickets.