Project

General

Profile

Actions

action #93650

closed

alert: PROBLEM Service Alert: openqa.suse.de/fs_/assets is WARNING

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2021-06-08
Due date:
2021-07-07
% Done:

0%

Estimated time:

Description

Observation

From nagios:

Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Mon Jun 7 16:51:54 UTC 2021
Info: WARN - 80.2% used (5.62 of 7.00 TB), trend: +81.34 GB / 24 hours

Service: fs_/assets

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fassets

I have seen in the past days multiple "WARNING" and "OK" messages alternating.

Acceptance criteria

  • AC1: nagios only sends alert messages if grafana is also alerting and the condition is more severe than configured in grafana

Suggestions

Impact

Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue


Related issues 1 (0 open1 closed)

Copied to openQA Infrastructure - action #94576: alert: PROBLEM Service Alert: openqa.suse.de/fs_/results is WARNINGResolvedokurz

Actions
Actions #1

Updated by mkittler almost 3 years ago

Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue

Where was the alert visible? I am subscribed to osd-admins@suse.de but didn't receive an email (apart from the Re: you've just sent to the list).

Ensure you have access to https://gitlab.suse.de/OPS-Service/monitoring/ , ask in EngInfra ticket otherwise

When accessing the page I get 404. I assume this is actually 403. So I'll ask infra for access. (In the meantime someone else can pick up the ticket of course.)

Actions #2

Updated by okurz almost 3 years ago

mkittler wrote:

Prioritized as "Urgent" as the alert was ignored and not handled by multiple persons for days and we are apparently suffering from alarm fatigue

Where was the alert visible? I am subscribed to osd-admins@suse.de but didn't receive an email (apart from the Re: you've just sent to the list).

ah, good point. The alert was coming from nagios and is visible in the email and also in the referenced URL https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fassets
I thought you have worked with nagios alerts in the past?
Just today I added in https://progress.opensuse.org/projects/qa/wiki/Wiki#Onboarding-for-new-joiners "Ensure you have access to https://gitlab.suse.de/OPS-Service/monitoring (create EngInfra ticket otherwise) and add yourself in https://gitlab.suse.de/OPS-Service/monitoring/-/tree/master/icinga/shared/contacts to receive monitoring information".

Actions #3

Updated by okurz almost 3 years ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?editPanel=74&orgId=1&from=1600646082068&to=1623184779476 shows that we have not exceeded 90% for long but the alert is configured for 94% . I think we should go for 90% again, same as for other filesystems. In https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L60 we have configured to keep 20% free, i.e. 80% usage . And then nagios should be above the grafana alerting limit, e.g. 92% warning, 94% critical.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/502
and
https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/16

Actions #4

Updated by okurz almost 3 years ago

  • Due date set to 2021-07-07
  • Status changed from Feedback to Blocked
Actions #5

Updated by okurz almost 3 years ago

Saw more alerts. My MR was still ignored. Created ticket as reminder: https://infra.nue.suse.com/SelfService/Display.html?id=189974

Actions #6

Updated by okurz almost 3 years ago

MR was merged but the change is not effective. Maybe I need to explicitly mention "assets" in a separate line: https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/18

Actions #7

Updated by okurz almost 3 years ago

  • Status changed from Blocked to Resolved

This worked. Alert thresholds in nagios are fine as well as grafana.

Actions #8

Updated by okurz almost 3 years ago

  • Copied to action #94576: alert: PROBLEM Service Alert: openqa.suse.de/fs_/results is WARNING added
Actions

Also available in: Atom PDF