Actions
action #93650
closedalert: PROBLEM Service Alert: openqa.suse.de/fs_/assets is WARNING
Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2021-06-08
Due date:
2021-07-07
% Done:
0%
Estimated time:
Description
Observation¶
From nagios:
Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Mon Jun 7 16:51:54 UTC 2021
Info: WARN - 80.2% used (5.62 of 7.00 TB), trend: +81.34 GB / 24 hours
Service: fs_/assets
See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fassets
I have seen in the past days multiple "WARNING" and "OK" messages alternating.
Acceptance criteria¶
- AC1: nagios only sends alert messages if grafana is also alerting and the condition is more severe than configured in grafana
Suggestions¶
- Ensure you have access to https://gitlab.suse.de/OPS-Service/monitoring/ , ask in EngInfra ticket otherwise
- Adapt levels in https://gitlab.suse.de/OPS-Service/monitoring/-/blob/master/check_mk/nue-cmk/main.mk#L51 and our grafana so that our grafana alert level is below that
Impact¶
Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue
Actions