action #132815: [alert][flaky][o3] Multiple flaky zabbix alerts related to o3 - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #132815

closed

[alert][flaky][o3] Multiple flaky zabbix alerts related to o3

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

jbaier_cz

Category:

Target version:

openQA Project - Ready

Start date:

2023-07-16

Due date:

% Done:

Estimated time:

Tags:

alert, o3, infra

Description

Observation¶

Multiple emails received, see https://mailman.suse.de/mlarch/SuSE/o3-admins/2023/o3-admins.2023.07/maillist.html

Rollback steps¶

enable "Load average is too high" trigger

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by jbaier_cz over 1 year ago

Status changed from New to In Progress
Assignee set to jbaier_cz

Actions

Copy link

Updated by jbaier_cz over 1 year ago

Status changed from In Progress to Feedback

After brief investigation I find out that the notification settings inside zabbix were overly verbose. I reconfigured it to alert only on issues which are "Averege" and higher (at this moment, this should include severe filesystem and accessibility problems).

Actions

Copy link

Updated by jbaier_cz over 1 year ago

Threshold tweaks are currently blocked by missing permissions SD-127003. I will wait overnight to see if the notification settings was actually enough or if we need to wait for the SD ticket.

Actions

Copy link

Updated by tinita over 1 year ago

Currently /assets is over 80%, and we don't get an email, but we probably should, just that the threshold should be a bit higher (cleaning up of assets happens when it's over 80%, so we expect it to be a bit over 80% for a short duration).

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #132278: Basic o3 http response alert on zabbix size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Related to coordination #132275: [epic] Better o3 monitoring added

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #131150: Add alarms for partition usage on o3 size:M added

Actions

Copy link

Updated by jbaier_cz over 1 year ago

The email alert threshold is 90%.

Actions

Copy link

Updated by jbaier_cz over 1 year ago

jbaier_cz wrote:

The email alert threshold is 90%.

However I can't verify that, because the VFS.FS.PUSED.MAX.CRIT macro with that value is hidden in the not yet visible template behind SD-127003.

Actions

Copy link

#10

Updated by okurz over 1 year ago

From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)". Please look up the according threshold that we defined for OSD in grafana and I suggest to apply that here as well, something like in the range of 20-40. 1.5 is way too restrictive for our setup.

Actions

Copy link

#11

Updated by jbaier_cz over 1 year ago

Description updated (diff)
Status changed from Feedback to Blocked

okurz wrote:

From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)".

Hm, I see. So we are hitting also the average trigger for CPU... Blocking this on SD-127003 (to be able to change the threshold). In the mean time, I will deactivate the CPU load trigger warning to suppress the emails.

Actions

Copy link

#12

Updated by tinita over 1 year ago

okurz wrote:

From today: "Problem: Load average is too high (per CPU load over 1.5 for 5m)". Please look up the according threshold that we defined for OSD in grafana and I suggest to apply that here as well, something like in the range of 20-40. 1.5 is way too restrictive for our setup.

20-40? Note that the current threshold 1.5 is the average per CPU, so 15 for 10 CPUs. Maybe set it to 4? (As soon as we get access).

Actions

Copy link

#13

Updated by jbaier_cz over 1 year ago

Status changed from Blocked to In Progress

SD ticket resolved, we are no longer blocked.

Actions

Copy link

#14

Updated by jbaier_cz over 1 year ago

Status changed from In Progress to Resolved

Threshold for alerts were adjusted: CPU load should trigger on 4 times the cpu cores (that should be load over 40 in our case), storage triggers should warn us around 90% of space used, the same holds for available memory and finally, we have some availability triggers. I did not see any alert since this morning, so I am considering this resolved for now. Any additional improvement should be made (and will be made) in other tickets.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #132815

[alert][flaky][o3] Multiple flaky zabbix alerts related to o3

Observation¶

Rollback steps¶

Updated by jbaier_cz over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by okurz over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by tinita over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by jbaier_cz over 1 year ago