action #68410: repeated alerts about "no data" that are not actionable and recover themselves often enough - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #68410

closed

repeated alerts about "no data" that are not actionable and recover themselves often enough

Added by okurz almost 5 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-06-24

Due date:

% Done:

Estimated time:

Description

Observation¶

I receive repeatedly from time to time email alerts about "[No Data] Queue: State (SUSE) alert", similar "[No Data] HTTP Response alert". I doubt anyone is acting on these but the systems automatically recover.

Problem¶

We should only have actionable and serious alerts to prevent alarm fatigue. The alerts should be prevented.

Actions

Copy link

Updated by okurz almost 5 years ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/328

Actions

Copy link

Updated by okurz almost 5 years ago

Status changed from Feedback to Resolved

merged, maybe alarms go away. I hope with this we have not lost an alert about "the host has completely vanished" with this ;)

Actions

Copy link

Updated by livdywan about 4 years ago

Status changed from Resolved to Feedback

[Alerting] Queue: State (SUSE) alert

Error message

tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+mean%28%22scheduled%22%29+FROM+%22openqa_jobs%22+WHERE+%22url%22+%3D+%27https%3A%2F%2Fopenqa.suse.de%27+AND+time+%3E+now%28%29+-+1m+GROUP+BY+time%2840s%29+fill%28null%29": dial tcp [::1]:8086: connect: connection refused
Metric name
	
Value

The alert went back to OK within the next minute.

I'm tentatively re-opening the ticket - feel free to create a new one if you think it's a new issue.

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from Feedback to Resolved
Target version set to Ready

The original issue was about "No data" whenever hosts are offline. We fixed that by ignoring "no data" and having a specific "host up" alert that is controlled by pinging the corresponding machine from another instance, e.g. openqa.suse.de pings each worker. What you reference looks like something quite different as it's an actual "Alert" but communication is refused. Please create a specific new issue about that.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #68410

repeated alerts about "no data" that are not actionable and recover themselves often enough

Observation¶

Problem¶

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by livdywan about 4 years ago

Updated by okurz about 4 years ago