Project

General

Profile

action #93650

alert: PROBLEM Service Alert: openqa.suse.de/fs_/assets is WARNING

Added by okurz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-06-08
Due date:
2021-07-07
% Done:

0%

Estimated time:

Description

Observation

From nagios:

Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Mon Jun 7 16:51:54 UTC 2021
Info: WARN - 80.2% used (5.62 of 7.00 TB), trend: +81.34 GB / 24 hours

Service: fs_/assets

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fassets

I have seen in the past days multiple "WARNING" and "OK" messages alternating.

Acceptance criteria

  • AC1: nagios only sends alert messages if grafana is also alerting and the condition is more severe than configured in grafana

Suggestions

Impact

Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue


Related issues

Copied to openQA Infrastructure - action #94576: alert: PROBLEM Service Alert: openqa.suse.de/fs_/results is WARNINGResolved

History

#1 Updated by mkittler about 2 months ago

Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue

Where was the alert visible? I am subscribed to osd-admins@suse.de but didn't receive an email (apart from the Re: you've just sent to the list).

Ensure you have access to https://gitlab.suse.de/OPS-Service/monitoring/ , ask in EngInfra ticket otherwise

When accessing the page I get 404. I assume this is actually 403. So I'll ask infra for access. (In the meantime someone else can pick up the ticket of course.)

#2 Updated by okurz about 2 months ago

mkittler wrote:

Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue

Where was the alert visible? I am subscribed to osd-admins@suse.de but didn't receive an email (apart from the Re: you've just sent to the list).

ah, good point. The alert was coming from nagios and is visible in the email and also in the referenced URL https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fassets
I thought you have worked with nagios alerts in the past?
Just today I added in https://progress.opensuse.org/projects/qa/wiki/Wiki#Onboarding-for-new-joiners "Ensure you have access to https://gitlab.suse.de/OPS-Service/monitoring (create EngInfra ticket otherwise) and add yourself in https://gitlab.suse.de/OPS-Service/monitoring/-/tree/master/icinga/shared/contacts to receive monitoring information".

#3 Updated by okurz about 2 months ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?editPanel=74&orgId=1&from=1600646082068&to=1623184779476 shows that we have not exceeded 90% for long but the alert is configured for 94% . I think we should go for 90% again, same as for other filesystems. In https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L60 we have configured to keep 20% free, i.e. 80% usage . And then nagios should be above the grafana alerting limit, e.g. 92% warning, 94% critical.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/502
and
https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/16

#4 Updated by okurz about 1 month ago

  • Due date set to 2021-07-07
  • Status changed from Feedback to Blocked

#5 Updated by okurz about 1 month ago

Saw more alerts. My MR was still ignored. Created ticket as reminder: https://infra.nue.suse.com/SelfService/Display.html?id=189974

#6 Updated by okurz about 1 month ago

MR was merged but the change is not effective. Maybe I need to explicitly mention "assets" in a separate line: https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/18

#7 Updated by okurz about 1 month ago

  • Status changed from Blocked to Resolved

This worked. Alert thresholds in nagios are fine as well as grafana.

#8 Updated by okurz about 1 month ago

  • Copied to action #94576: alert: PROBLEM Service Alert: openqa.suse.de/fs_/results is WARNING added

Also available in: Atom PDF