action #89275
closedFlaky Incomplete jobs (not restarted) of last 24h alert and New incompletes alert triggered and back to OK
0%
Description
Observation¶
I noticed these two alerts firing, then reverting to [OK]
, at 2:32Z:
[No Data] Incomplete jobs (not restarted) of last 24h alert
[Alerting] New incompletes alert
tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+non_negative_difference%28distinct%28%22incompletes_last_24h%22%29%29+FROM+%22postgresql%22+WHERE+time+%3E+now%28%29+-+1m+GROUP+BY+time%2850ms%29": dial tcp [::1]:8086: connect: connection refused
There is no Metric/Data in either, just the error message above.
Suggestion¶
- Investigate logfiles on openqa-monitor.qa.suse.de
- Find out what's on 8086 and why it refuses connections
Updated by okurz almost 4 years ago
- Priority changed from Normal to Urgent
The alert might look like flaky because we only look at the incompletes within a certain time period. Maybe we are hardly able to change that. However it's important that we find out what these incompletes are. One can simply call openqa-review manually to find out
Updated by mkittler almost 4 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
The graphs of the alerts don't show that the number of (new) incompletes goes over the threshold.
I suspect Grafana could simply not access InfluxDB. Hence it produced the alert but it is a false alert.
Telegraf was also restarted at the same time the alert mails were sent (28.02.21 03:32):
systemctl status telegraf.service
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2021-02-28 03:32:27 CET; 1 day 7h ago
That InfluxDB was shortly unavailable would also explain why everything was good again and OK mails were sent in the next minute.
Find out what's on 8086 and why it refuses connections
8086 is just the port InfluxDB is listening on (running on openqa-monitor.qa.suse.de
).
Investigate logfiles on openqa-monitor.qa.suse.de
I had a look into the journal of InfluxDB on openqa-monitor.qa.suse.de but unfortunately it logs every query so it is quite cluttered and I couldn't find anything obvious. In fact it looks like InfluxDB would actually respond to other queries within that time frame.
Updated by mkittler almost 4 years ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to Normal
Since both alerts were effectively no data alerts and the issue resolved itself in a minute I don't think it is worth looking into it and I also wouldn't know what else to check.
Updated by okurz almost 4 years ago
what could be done is to prevent the check for "No Data" and instead configure it to "keep last state". We do this for alert checks on workers. However considering that the data comes from the central web UI telegraf instance the situation is a bit different. However the service restart was at the time of the host rebooting after necessary automatically applied upgrades so this is likely to reappear. Hm, I doubt we can configure a longer grace time for "No Data", can we? If not I suggest to just configure for "keep last state" as well.
Updated by livdywan almost 4 years ago
From 9.40 to 9.47 I saw Incomplete jobs (not restarted) of last 24h alert
once again, twice going [No data]
and back to [OK]
respectively. This is why I called it flaky. If it's expected to resolve itself within e.g. 5 minutes there should be no alert.
Updated by livdywan almost 4 years ago
- Status changed from Feedback to In Progress
- Assignee changed from mkittler to livdywan
Updated by livdywan almost 4 years ago
cdywan wrote:
From 9.40 to 9.47 I saw
Incomplete jobs (not restarted) of last 24h alert
once again, twice going[No data]
and back to[OK]
respectively. This is why I called it flaky. If it's expected to resolve itself within e.g. 5 minutes there should be no alert.
To clarify: In this specific case the period is 1 minute. And I think we should increase it based on the alerts I'm observing. So actually 10 minutes is what I'm proposing in my MR.
Updated by livdywan almost 4 years ago
From 3.18 to 3.21:
[No Data] Incomplete jobs (not restarted) of last 24h alert
[OK] Incomplete jobs (not restarted) of last 24h alert
Updated by livdywan almost 4 years ago
- Status changed from In Progress to Feedback
cdywan wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/457
MR got merged, let's see how well this works
Updated by livdywan almost 4 years ago
- Status changed from Feedback to Resolved
I've not observed this particular alert recently.