Project

General

Profile

Actions

action #89275

closed

Flaky Incomplete jobs (not restarted) of last 24h alert and New incompletes alert triggered and back to OK

Added by livdywan about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-03-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

I noticed these two alerts firing, then reverting to [OK], at 2:32Z:

[No Data] Incomplete jobs (not restarted) of last 24h alert

[Alerting] New incompletes alert

tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+non_negative_difference%28distinct%28%22incompletes_last_24h%22%29%29+FROM+%22postgresql%22+WHERE+time+%3E+now%28%29+-+1m+GROUP+BY+time%2850ms%29": dial tcp [::1]:8086: connect: connection refused

There is no Metric/Data in either, just the error message above.

Suggestion

  • Investigate logfiles on openqa-monitor.qa.suse.de
  • Find out what's on 8086 and why it refuses connections
Actions #1

Updated by okurz about 3 years ago

  • Priority changed from Normal to Urgent

The alert might look like flaky because we only look at the incompletes within a certain time period. Maybe we are hardly able to change that. However it's important that we find out what these incompletes are. One can simply call openqa-review manually to find out

Actions #2

Updated by mkittler about 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

The graphs of the alerts don't show that the number of (new) incompletes goes over the threshold.

I suspect Grafana could simply not access InfluxDB. Hence it produced the alert but it is a false alert.

Telegraf was also restarted at the same time the alert mails were sent (28.02.21 03:32):

systemctl status telegraf.service 
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2021-02-28 03:32:27 CET; 1 day 7h ago

That InfluxDB was shortly unavailable would also explain why everything was good again and OK mails were sent in the next minute.


Find out what's on 8086 and why it refuses connections

8086 is just the port InfluxDB is listening on (running on openqa-monitor.qa.suse.de).


Investigate logfiles on openqa-monitor.qa.suse.de

I had a look into the journal of InfluxDB on openqa-monitor.qa.suse.de but unfortunately it logs every query so it is quite cluttered and I couldn't find anything obvious. In fact it looks like InfluxDB would actually respond to other queries within that time frame.

Actions #3

Updated by mkittler about 3 years ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal

Since both alerts were effectively no data alerts and the issue resolved itself in a minute I don't think it is worth looking into it and I also wouldn't know what else to check.

Actions #4

Updated by okurz about 3 years ago

what could be done is to prevent the check for "No Data" and instead configure it to "keep last state". We do this for alert checks on workers. However considering that the data comes from the central web UI telegraf instance the situation is a bit different. However the service restart was at the time of the host rebooting after necessary automatically applied upgrades so this is likely to reappear. Hm, I doubt we can configure a longer grace time for "No Data", can we? If not I suggest to just configure for "keep last state" as well.

Actions #5

Updated by livdywan about 3 years ago

From 9.40 to 9.47 I saw Incomplete jobs (not restarted) of last 24h alert once again, twice going [No data] and back to [OK] respectively. This is why I called it flaky. If it's expected to resolve itself within e.g. 5 minutes there should be no alert.

Actions #6

Updated by livdywan about 3 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from mkittler to livdywan
Actions #7

Updated by livdywan about 3 years ago

cdywan wrote:

From 9.40 to 9.47 I saw Incomplete jobs (not restarted) of last 24h alert once again, twice going [No data] and back to [OK] respectively. This is why I called it flaky. If it's expected to resolve itself within e.g. 5 minutes there should be no alert.

To clarify: In this specific case the period is 1 minute. And I think we should increase it based on the alerts I'm observing. So actually 10 minutes is what I'm proposing in my MR.

Actions #8

Updated by livdywan about 3 years ago

From 3.18 to 3.21:

[No Data] Incomplete jobs (not restarted) of last 24h alert
[OK] Incomplete jobs (not restarted) of last 24h alert
Actions #9

Updated by livdywan about 3 years ago

  • Status changed from In Progress to Feedback

cdywan wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/457

MR got merged, let's see how well this works

Actions #10

Updated by livdywan about 3 years ago

  • Status changed from Feedback to Resolved

I've not observed this particular alert recently.

Actions

Also available in: Atom PDF