Project

General

Profile

Actions

action #68410

closed

repeated alerts about "no data" that are not actionable and recover themselves often enough

Added by okurz almost 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-06-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

I receive repeatedly from time to time email alerts about "[No Data] Queue: State (SUSE) alert", similar "[No Data] HTTP Response alert". I doubt anyone is acting on these but the systems automatically recover.

Problem

We should only have actionable and serious alerts to prevent alarm fatigue. The alerts should be prevented.

Actions #1

Updated by okurz almost 4 years ago

  • Status changed from In Progress to Feedback
Actions #2

Updated by okurz almost 4 years ago

  • Status changed from Feedback to Resolved

merged, maybe alarms go away. I hope with this we have not lost an alert about "the host has completely vanished" with this ;)

Actions #3

Updated by livdywan about 3 years ago

  • Status changed from Resolved to Feedback
[Alerting] Queue: State (SUSE) alert

Error message

tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+mean%28%22scheduled%22%29+FROM+%22openqa_jobs%22+WHERE+%22url%22+%3D+%27https%3A%2F%2Fopenqa.suse.de%27+AND+time+%3E+now%28%29+-+1m+GROUP+BY+time%2840s%29+fill%28null%29": dial tcp [::1]:8086: connect: connection refused
Metric name

Value

The alert went back to OK within the next minute.

I'm tentatively re-opening the ticket - feel free to create a new one if you think it's a new issue.

Actions #4

Updated by okurz about 3 years ago

  • Status changed from Feedback to Resolved
  • Target version set to Ready

The original issue was about "No data" whenever hosts are offline. We fixed that by ignoring "no data" and having a specific "host up" alert that is controlled by pinging the corresponding machine from another instance, e.g. openqa.suse.de pings each worker. What you reference looks like something quite different as it's an actual "Alert" but communication is refused. Please create a specific new issue about that.

Actions

Also available in: Atom PDF