repeated alerts about "no data" that are not actionable and recover themselves often enough
I receive repeatedly from time to time email alerts about "[No Data] Queue: State (SUSE) alert", similar "[No Data] HTTP Response alert". I doubt anyone is acting on these but the systems automatically recover.
We should only have actionable and serious alerts to prevent alarm fatigue. The alerts should be prevented.
#1 Updated by okurz about 1 year ago
- Status changed from In Progress to Feedback
- Status changed from Resolved to Feedback
[Alerting] Queue: State (SUSE) alert Error message tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+mean%28%22scheduled%22%29+FROM+%22openqa_jobs%22+WHERE+%22url%22+%3D+%27https%3A%2F%2Fopenqa.suse.de%27+AND+time+%3E+now%28%29+-+1m+GROUP+BY+time%2840s%29+fill%28null%29": dial tcp [::1]:8086: connect: connection refused Metric name Value
The alert went back to OK within the next minute.
I'm tentatively re-opening the ticket - feel free to create a new one if you think it's a new issue.
- Status changed from Feedback to Resolved
- Target version set to Ready
The original issue was about "No data" whenever hosts are offline. We fixed that by ignoring "no data" and having a specific "host up" alert that is controlled by pinging the corresponding machine from another instance, e.g. openqa.suse.de pings each worker. What you reference looks like something quite different as it's an actual "Alert" but communication is refused. Please create a specific new issue about that.