Project

General

Profile

action #89275

Flaky Incomplete jobs (not restarted) of last 24h alert and New incompletes alert triggered and back to OK

Added by cdywan 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-03-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

I noticed these two alerts firing, then reverting to [OK], at 2:32Z:

[No Data] Incomplete jobs (not restarted) of last 24h alert

[Alerting] New incompletes alert

tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+non_negative_difference%28distinct%28%22incompletes_last_24h%22%29%29+FROM+%22postgresql%22+WHERE+time+%3E+now%28%29+-+1m+GROUP+BY+time%2850ms%29": dial tcp [::1]:8086: connect: connection refused

There is no Metric/Data in either, just the error message above.

Suggestion

  • Investigate logfiles on openqa-monitor.qa.suse.de
  • Find out what's on 8086 and why it refuses connections

History

#1 Updated by okurz 5 months ago

  • Priority changed from Normal to Urgent

The alert might look like flaky because we only look at the incompletes within a certain time period. Maybe we are hardly able to change that. However it's important that we find out what these incompletes are. One can simply call openqa-review manually to find out

#2 Updated by mkittler 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

The graphs of the alerts don't show that the number of (new) incompletes goes over the threshold.

I suspect Grafana could simply not access InfluxDB. Hence it produced the alert but it is a false alert.

Telegraf was also restarted at the same time the alert mails were sent (28.02.21 03:32):

systemctl status telegraf.service 
‚óŹ telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2021-02-28 03:32:27 CET; 1 day 7h ago

That InfluxDB was shortly unavailable would also explain why everything was good again and OK mails were sent in the next minute.


Find out what's on 8086 and why it refuses connections

8086 is just the port InfluxDB is listening on (running on openqa-monitor.qa.suse.de).


Investigate logfiles on openqa-monitor.qa.suse.de

I had a look into the journal of InfluxDB on openqa-monitor.qa.suse.de but unfortunately it logs every query so it is quite cluttered and I couldn't find anything obvious. In fact it looks like InfluxDB would actually respond to other queries within that time frame.

#3 Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal

Since both alerts were effectively no data alerts and the issue resolved itself in a minute I don't think it is worth looking into it and I also wouldn't know what else to check.

#4 Updated by okurz 5 months ago

what could be done is to prevent the check for "No Data" and instead configure it to "keep last state". We do this for alert checks on workers. However considering that the data comes from the central web UI telegraf instance the situation is a bit different. However the service restart was at the time of the host rebooting after necessary automatically applied upgrades so this is likely to reappear. Hm, I doubt we can configure a longer grace time for "No Data", can we? If not I suggest to just configure for "keep last state" as well.

#5 Updated by cdywan 5 months ago

From 9.40 to 9.47 I saw Incomplete jobs (not restarted) of last 24h alert once again, twice going [No data] and back to [OK] respectively. This is why I called it flaky. If it's expected to resolve itself within e.g. 5 minutes there should be no alert.

#6 Updated by cdywan 5 months ago

  • Status changed from Feedback to In Progress
  • Assignee changed from mkittler to cdywan

#7 Updated by cdywan 5 months ago

cdywan wrote:

From 9.40 to 9.47 I saw Incomplete jobs (not restarted) of last 24h alert once again, twice going [No data] and back to [OK] respectively. This is why I called it flaky. If it's expected to resolve itself within e.g. 5 minutes there should be no alert.

To clarify: In this specific case the period is 1 minute. And I think we should increase it based on the alerts I'm observing. So actually 10 minutes is what I'm proposing in my MR.

#8 Updated by cdywan 5 months ago

From 3.18 to 3.21:

[No Data] Incomplete jobs (not restarted) of last 24h alert
[OK] Incomplete jobs (not restarted) of last 24h alert

#9 Updated by cdywan 5 months ago

  • Status changed from In Progress to Feedback

cdywan wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/457

MR got merged, let's see how well this works

#10 Updated by cdywan 5 months ago

  • Status changed from Feedback to Resolved

I've not observed this particular alert recently.

Also available in: Atom PDF