action #93650
closedalert: PROBLEM Service Alert: openqa.suse.de/fs_/assets is WARNING
0%
Description
Observation¶
From nagios:
Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Mon Jun 7 16:51:54 UTC 2021
Info: WARN - 80.2% used (5.62 of 7.00 TB), trend: +81.34 GB / 24 hours
Service: fs_/assets
See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fassets
I have seen in the past days multiple "WARNING" and "OK" messages alternating.
Acceptance criteria¶
- AC1: nagios only sends alert messages if grafana is also alerting and the condition is more severe than configured in grafana
Suggestions¶
- Ensure you have access to https://gitlab.suse.de/OPS-Service/monitoring/ , ask in EngInfra ticket otherwise
- Adapt levels in https://gitlab.suse.de/OPS-Service/monitoring/-/blob/master/check_mk/nue-cmk/main.mk#L51 and our grafana so that our grafana alert level is below that
Impact¶
Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue
Updated by mkittler over 3 years ago
Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue
Where was the alert visible? I am subscribed to osd-admins@suse.de but didn't receive an email (apart from the Re:
you've just sent to the list).
Ensure you have access to https://gitlab.suse.de/OPS-Service/monitoring/ , ask in EngInfra ticket otherwise
When accessing the page I get 404. I assume this is actually 403. So I'll ask infra for access. (In the meantime someone else can pick up the ticket of course.)
Updated by okurz over 3 years ago
mkittler wrote:
Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue
Where was the alert visible? I am subscribed to osd-admins@suse.de but didn't receive an email (apart from the
Re:
you've just sent to the list).
ah, good point. The alert was coming from nagios and is visible in the email and also in the referenced URL https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fassets
I thought you have worked with nagios alerts in the past?
Just today I added in https://progress.opensuse.org/projects/qa/wiki/Wiki#Onboarding-for-new-joiners "Ensure you have access to https://gitlab.suse.de/OPS-Service/monitoring (create EngInfra ticket otherwise) and add yourself in https://gitlab.suse.de/OPS-Service/monitoring/-/tree/master/icinga/shared/contacts to receive monitoring information".
Updated by okurz over 3 years ago
- Status changed from Workable to Feedback
- Assignee set to okurz
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?editPanel=74&orgId=1&from=1600646082068&to=1623184779476 shows that we have not exceeded 90% for long but the alert is configured for 94% . I think we should go for 90% again, same as for other filesystems. In https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L60 we have configured to keep 20% free, i.e. 80% usage . And then nagios should be above the grafana alerting limit, e.g. 92% warning, 94% critical.
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/502
and
https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/16
Updated by okurz over 3 years ago
- Due date set to 2021-07-07
- Status changed from Feedback to Blocked
Updated by okurz over 3 years ago
Saw more alerts. My MR was still ignored. Created ticket as reminder: https://infra.nue.suse.com/SelfService/Display.html?id=189974
Updated by okurz over 3 years ago
MR was merged but the change is not effective. Maybe I need to explicitly mention "assets" in a separate line: https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/18
Updated by okurz over 3 years ago
- Status changed from Blocked to Resolved
This worked. Alert thresholds in nagios are fine as well as grafana.
Updated by okurz over 3 years ago
- Copied to action #94576: alert: PROBLEM Service Alert: openqa.suse.de/fs_/results is WARNING added