Flaky Incomplete jobs (not restarted) of last 24h alert and New incompletes alert triggered and back to OK
I noticed these two alerts firing, then reverting to
[OK], at 2:32Z:
[No Data] Incomplete jobs (not restarted) of last 24h alert
[Alerting] New incompletes alert
tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+non_negative_difference%28distinct%28%22incompletes_last_24h%22%29%29+FROM+%22postgresql%22+WHERE+time+%3E+now%28%29+-+1m+GROUP+BY+time%2850ms%29": dial tcp [::1]:8086: connect: connection refused
There is no Metric/Data in either, just the error message above.
- Investigate logfiles on openqa-monitor.qa.suse.de
- Find out what's on 8086 and why it refuses connections
- Priority changed from Normal to Urgent
The alert might look like flaky because we only look at the incompletes within a certain time period. Maybe we are hardly able to change that. However it's important that we find out what these incompletes are. One can simply call openqa-review manually to find out
- Status changed from Workable to In Progress
- Assignee set to mkittler
The graphs of the alerts don't show that the number of (new) incompletes goes over the threshold.
I suspect Grafana could simply not access InfluxDB. Hence it produced the alert but it is a false alert.
Telegraf was also restarted at the same time the alert mails were sent (28.02.21 03:32):
systemctl status telegraf.service ● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2021-02-28 03:32:27 CET; 1 day 7h ago
That InfluxDB was shortly unavailable would also explain why everything was good again and OK mails were sent in the next minute.
Find out what's on 8086 and why it refuses connections
8086 is just the port InfluxDB is listening on (running on
Investigate logfiles on openqa-monitor.qa.suse.de
I had a look into the journal of InfluxDB on openqa-monitor.qa.suse.de but unfortunately it logs every query so it is quite cluttered and I couldn't find anything obvious. In fact it looks like InfluxDB would actually respond to other queries within that time frame.
what could be done is to prevent the check for "No Data" and instead configure it to "keep last state". We do this for alert checks on workers. However considering that the data comes from the central web UI telegraf instance the situation is a bit different. However the service restart was at the time of the host rebooting after necessary automatically applied upgrades so this is likely to reappear. Hm, I doubt we can configure a longer grace time for "No Data", can we? If not I suggest to just configure for "keep last state" as well.
- Status changed from Feedback to In Progress
- Assignee changed from mkittler to cdywan
From 9.40 to 9.47 I saw
Incomplete jobs (not restarted) of last 24h alertonce again, twice going
[No data]and back to
[OK]respectively. This is why I called it flaky. If it's expected to resolve itself within e.g. 5 minutes there should be no alert.
To clarify: In this specific case the period is 1 minute. And I think we should increase it based on the alerts I'm observing. So actually 10 minutes is what I'm proposing in my MR.
- Status changed from In Progress to Feedback
MR got merged, let's see how well this works