action #125303: prevent confusing "no data" alerts size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #125303

closed

prevent confusing "no data" alerts size:M

Added by mkittler over 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-03-02

Due date:

2023-04-07

% Done:

Estimated time:

Tags:

alert, infra

Description

Observation¶

We used "no data" as alert trigger for our "host up" alerts. This caused confusion after switching to the new unified alerting system in grafana because we thought that no data was provided by telegraf while in reality it was a valid alert.

Acceptance criteria¶

AC1: We don't rely on "no data"-triggers for other purposes (e.g. host up, etc)

Suggestions¶

Wait for a Grafana 9.1 update so we can provision alerts from files
Change the "host up"-alert from using "average_response_ms" to "result_code"
Crosscheck if we already have a solution for telegraf not being able to push data to influxdb

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by osukup over 2 years ago

going trhoght journal on ow14 - no issues present , it looks like monitoring lost connection to ow14 and nothing else

Actions

Copy link

Updated by okurz over 2 years ago

Target version set to Ready

Actions

Copy link

Updated by okurz about 2 years ago

Subject changed from openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23 to ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23)
Description updated (diff)
Status changed from New to Blocked
Assignee set to nicksinger

@nicksinger please track this ticket as being blocked by #122845 which you are working on right now

Actions

Copy link

Updated by okurz about 2 years ago

Related to action #122845: Migrate our Grafana setup to "unified alerting" added

Actions

Copy link

Updated by okurz about 2 years ago

Priority changed from Normal to High

ok, it looks like we have multiple no data messages received also over the weekend so this is more pressing.

Actions

Copy link

Updated by mkittler about 2 years ago

Status changed from Blocked to New

Not blocked by #122845. It is one particular aspect of it, though.

Actions

Copy link

Updated by mkittler about 2 years ago

Subject changed from ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) to ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) size:M
Status changed from New to Workable

Actions

Copy link

Updated by okurz about 2 years ago

Priority changed from High to Urgent

Lot's of emails ...

Actions

Copy link

Updated by okurz about 2 years ago

Description updated (diff)
Priority changed from Urgent to High

I added a silence matching alertname=DatasourceNoData. That should help us for the time being until we understand better.

Actions

Copy link

#10

Updated by okurz about 2 years ago

Discussed with nicksinger. The general silencing does not make sense as the "host up" alert was actually the only real alert where we would care about data because "average_response_ms" never returns a value if there is no response. However, it looks like we never challenged that query design which was in since 2019 in salt-states-openqa commit 5ae5356. So I deleted the generic silence again. Also openqa-piworker just reappeared after dheidler fixed the network config so I also unsilenced the specific alert about openqa-piworker.

Looking back into my email archive over the past days I could only find "FIRING.*DatasourceNoData.*host up alert" which are the good ones. So it seems we never had an unintended message about NoData. Still, we can try to improve the alert by switching to "result_code" checking which apparently yields 0 in case of successful ping response and 1 otherwise. We changed the alert but as we already know the alert configuration is not saved in the exported json file. So that is something to continue … or manually change all ping alerts to use "last of packets_received, alert if max is below 1"

Actions

Copy link

#11

Updated by nicksinger about 2 years ago

Subject changed from ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) size:M to prevent confusing "no data" alerts size:M
Description updated (diff)
Status changed from Workable to Blocked
Priority changed from High to Low

Actions

Copy link

#12

Updated by nicksinger about 2 years ago

Blocked by action #125642: Manage "unified alerting" via salt size:M added

Actions

Copy link

#13

Updated by okurz about 2 years ago

Status changed from Blocked to New
Priority changed from Low to High

An additional problem seems to be that the weekly reboot of machines can trigger a lot of alerts. Maybe the pending period was not properly migrated and needs to be increased again

Actions

Copy link

#14

Updated by okurz about 2 years ago

grafana 9.3.6 was built in https://build.opensuse.org/package/show/server:monitoring/grafana but not yet published so we can monitor http://download.opensuse.org/repositories/server:/monitoring/15.4/x86_64/?P=grafana* and upgrade as soon as published.

Actions

Copy link

#15

Updated by livdywan about 2 years ago

Status changed from New to In Progress
Assignee changed from nicksinger to livdywan

We decided to rollback the alerts for now. This can be done by adjusting the config file so I'll take care of that step

Actions

Copy link

#16

Updated by livdywan about 2 years ago

cdywan wrote:

We decided to rollback the alerts for now. This can be done by adjusting the config file so I'll take care of that step

For the record the way we write the config file it will be overridden fully which is why I decided against side-stepping salt and prepared an MR for it.

Actions

Copy link

#17

Updated by openqa_review about 2 years ago

Due date set to 2023-03-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#18

Updated by livdywan about 2 years ago

Status changed from In Progress to Workable
Assignee deleted (~~livdywan~~)

I assume we're good here. Unassigning for now. Somebody else may pick up for the next step.

Actions

Copy link

#19

Updated by okurz about 2 years ago

The rollback had the effect that apparently the alert list on grafana does not show "unified alerting" alerts anymore and we receive emails in the old format again, like subject "[Alerting]" and "[Ok]" instead of "[FIRING]" and "[RESOLVED]". However https://monitor.qa.suse.de/alerting/list shows an error about "Failed to load Grafana rules state: 404 from rule state endpoint. Perhaps ruler API is not enabled". However that might also come from the automatic upgrade to grafana 9.3.6 which happened over night so I suggest to do a web research for that error message, enable a simple boolean option in the grafana config, restart the service and have it hopefully fixed.

Actions

Copy link

#20

Updated by mkittler about 2 years ago

I'm not sure where you're seeing that error. It seems to be related to Loki which I don't thing we even use. I don't think there's a simple boolean flag to enable (unless we would actually use Loki).

Actions

Copy link

#21

Updated by nicksinger about 2 years ago

Assignee set to nicksinger

Actions

Copy link

#22

Updated by nicksinger about 2 years ago

Status changed from Workable to In Progress

Actions

Copy link

#23

Updated by okurz about 2 years ago

Due date changed from 2023-03-28 to 2023-04-07

discussed in daily infra call 2023-03-30

Actions

Copy link

#24

Updated by livdywan about 2 years ago

Status changed from In Progress to Resolved

AC1: We don't rely on "no data"-triggers for other purposes (e.g. host up, etc)

I double-checked in the team chat. We still see no data alerts e.g. today there was one **Queue: State (SUSE) alert ** but not for the availability of hosts. So the AC is fulfilled. And we can come up with follow-up tickets as needed as part of the regular alert handling.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #125303

prevent confusing "no data" alerts size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by osukup over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by nicksinger about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by openqa_review about 2 years ago

Updated by livdywan about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by nicksinger about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago