Project

General

Profile

Actions

action #125303

closed

prevent confusing "no data" alerts size:M

Added by mkittler over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-03-02
Due date:
2023-04-07
% Done:

0%

Estimated time:
Tags:

Description

Observation

We used "no data" as alert trigger for our "host up" alerts. This caused confusion after switching to the new unified alerting system in grafana because we thought that no data was provided by telegraf while in reality it was a valid alert.

Acceptance criteria

  • AC1: We don't rely on "no data"-triggers for other purposes (e.g. host up, etc)

Suggestions

  • Wait for a Grafana 9.1 update so we can provision alerts from files
  • Change the "host up"-alert from using "average_response_ms" to "result_code"
  • Crosscheck if we already have a solution for telegraf not being able to push data to influxdb

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #122845: Migrate our Grafana setup to "unified alerting"Resolvednicksinger2023-01-09

Actions
Blocked by openQA Infrastructure - action #125642: Manage "unified alerting" via salt size:MResolvedmkittler2023-01-09

Actions
Actions #1

Updated by osukup over 1 year ago

going trhoght journal on ow14 - no issues present , it looks like monitoring lost connection to ow14 and nothing else

Actions #2

Updated by okurz over 1 year ago

  • Target version set to Ready
Actions #3

Updated by okurz over 1 year ago

  • Subject changed from openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23 to ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23)
  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to nicksinger

@nicksinger please track this ticket as being blocked by #122845 which you are working on right now

Actions #4

Updated by okurz over 1 year ago

  • Related to action #122845: Migrate our Grafana setup to "unified alerting" added
Actions #5

Updated by okurz over 1 year ago

  • Priority changed from Normal to High

ok, it looks like we have multiple no data messages received also over the weekend so this is more pressing.

Actions #6

Updated by mkittler over 1 year ago

  • Status changed from Blocked to New

Not blocked by #122845. It is one particular aspect of it, though.

Actions #7

Updated by mkittler over 1 year ago

  • Subject changed from ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) to ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) size:M
  • Status changed from New to Workable
Actions #8

Updated by okurz over 1 year ago

  • Priority changed from High to Urgent

Lot's of emails ...

Actions #9

Updated by okurz over 1 year ago

  • Description updated (diff)
  • Priority changed from Urgent to High

I added a silence matching alertname=DatasourceNoData. That should help us for the time being until we understand better.

Actions #10

Updated by okurz over 1 year ago

Discussed with nicksinger. The general silencing does not make sense as the "host up" alert was actually the only real alert where we would care about data because "average_response_ms" never returns a value if there is no response. However, it looks like we never challenged that query design which was in since 2019 in salt-states-openqa commit 5ae5356. So I deleted the generic silence again. Also openqa-piworker just reappeared after dheidler fixed the network config so I also unsilenced the specific alert about openqa-piworker.

Looking back into my email archive over the past days I could only find "FIRING.DatasourceNoData.*host up alert" which are the good ones. So it seems we never had an unintended message about NoData. Still, we can try to improve the alert by switching to "result_code" checking which apparently yields 0 in case of successful ping response and 1 otherwise. We changed the alert but as we already know the alert configuration is not saved in the exported json file. So that is something to continue … or manually change *all ping alerts to use "last of packets_received, alert if max is below 1"

Actions #11

Updated by nicksinger over 1 year ago

  • Subject changed from ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) size:M to prevent confusing "no data" alerts size:M
  • Description updated (diff)
  • Status changed from Workable to Blocked
  • Priority changed from High to Low
Actions #12

Updated by nicksinger over 1 year ago

  • Blocked by action #125642: Manage "unified alerting" via salt size:M added
Actions #13

Updated by okurz over 1 year ago

  • Status changed from Blocked to New
  • Priority changed from Low to High

An additional problem seems to be that the weekly reboot of machines can trigger a lot of alerts. Maybe the pending period was not properly migrated and needs to be increased again

Actions #14

Updated by okurz over 1 year ago

grafana 9.3.6 was built in https://build.opensuse.org/package/show/server:monitoring/grafana but not yet published so we can monitor http://download.opensuse.org/repositories/server:/monitoring/15.4/x86_64/?P=grafana* and upgrade as soon as published.

Actions #15

Updated by livdywan over 1 year ago

  • Status changed from New to In Progress
  • Assignee changed from nicksinger to livdywan

We decided to rollback the alerts for now. This can be done by adjusting the config file so I'll take care of that step

Actions #16

Updated by livdywan over 1 year ago

cdywan wrote:

We decided to rollback the alerts for now. This can be done by adjusting the config file so I'll take care of that step

For the record the way we write the config file it will be overridden fully which is why I decided against side-stepping salt and prepared an MR for it.

Actions #17

Updated by openqa_review over 1 year ago

  • Due date set to 2023-03-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #18

Updated by livdywan over 1 year ago

  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

I assume we're good here. Unassigning for now. Somebody else may pick up for the next step.

Actions #19

Updated by okurz over 1 year ago

The rollback had the effect that apparently the alert list on grafana does not show "unified alerting" alerts anymore and we receive emails in the old format again, like subject "[Alerting]" and "[Ok]" instead of "[FIRING]" and "[RESOLVED]". However https://monitor.qa.suse.de/alerting/list shows an error about "Failed to load Grafana rules state: 404 from rule state endpoint. Perhaps ruler API is not enabled". However that might also come from the automatic upgrade to grafana 9.3.6 which happened over night so I suggest to do a web research for that error message, enable a simple boolean option in the grafana config, restart the service and have it hopefully fixed.

Actions #20

Updated by mkittler over 1 year ago

I'm not sure where you're seeing that error. It seems to be related to Loki which I don't thing we even use. I don't think there's a simple boolean flag to enable (unless we would actually use Loki).

Actions #21

Updated by nicksinger about 1 year ago

  • Assignee set to nicksinger
Actions #22

Updated by nicksinger about 1 year ago

  • Status changed from Workable to In Progress
Actions #23

Updated by okurz about 1 year ago

  • Due date changed from 2023-03-28 to 2023-04-07

discussed in daily infra call 2023-03-30

Actions #24

Updated by livdywan about 1 year ago

  • Status changed from In Progress to Resolved
  • AC1: We don't rely on "no data"-triggers for other purposes (e.g. host up, etc)

I double-checked in the team chat. We still see no data alerts e.g. today there was one *Queue: State (SUSE) alert * but not for the availability of hosts. So the AC is fulfilled. And we can come up with follow-up tickets as needed as part of the regular alert handling.

Actions

Also available in: Atom PDF