action #68410
closedrepeated alerts about "no data" that are not actionable and recover themselves often enough
0%
Description
Observation¶
I receive repeatedly from time to time email alerts about "[No Data] Queue: State (SUSE) alert", similar "[No Data] HTTP Response alert". I doubt anyone is acting on these but the systems automatically recover.
Problem¶
We should only have actionable and serious alerts to prevent alarm fatigue. The alerts should be prevented.
Updated by okurz over 4 years ago
- Status changed from In Progress to Feedback
Updated by okurz over 4 years ago
- Status changed from Feedback to Resolved
merged, maybe alarms go away. I hope with this we have not lost an alert about "the host has completely vanished" with this ;)
Updated by livdywan almost 4 years ago
- Status changed from Resolved to Feedback
[Alerting] Queue: State (SUSE) alert
Error message
tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+mean%28%22scheduled%22%29+FROM+%22openqa_jobs%22+WHERE+%22url%22+%3D+%27https%3A%2F%2Fopenqa.suse.de%27+AND+time+%3E+now%28%29+-+1m+GROUP+BY+time%2840s%29+fill%28null%29": dial tcp [::1]:8086: connect: connection refused
Metric name
Value
The alert went back to OK within the next minute.
I'm tentatively re-opening the ticket - feel free to create a new one if you think it's a new issue.
Updated by okurz almost 4 years ago
- Status changed from Feedback to Resolved
- Target version set to Ready
The original issue was about "No data" whenever hosts are offline. We fixed that by ignoring "no data" and having a specific "host up" alert that is controlled by pinging the corresponding machine from another instance, e.g. openqa.suse.de pings each worker. What you reference looks like something quite different as it's an actual "Alert" but communication is refused. Please create a specific new issue about that.