action #125303: prevent confusing "no data" alerts size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #125303

closed

prevent confusing "no data" alerts size:M

Added by mkittler almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-03-02

Due date:

2023-04-07

% Done:

Estimated time:

Tags:

alert, infra

Description

Observation¶

We used "no data" as alert trigger for our "host up" alerts. This caused confusion after switching to the new unified alerting system in grafana because we thought that no data was provided by telegraf while in reality it was a valid alert.

Acceptance criteria¶

AC1: We don't rely on "no data"-triggers for other purposes (e.g. host up, etc)

Suggestions¶

Wait for a Grafana 9.1 update so we can provision alerts from files
Change the "host up"-alert from using "average_response_ms" to "result_code"
Crosscheck if we already have a solution for telegraf not being able to push data to influxdb

Related issues 2 (0 open — 2 closed)

Related to openQA Infrastructure (public) - action #122845: Migrate our Grafana setup to "unified alerting"

Resolved

nicksinger

2023-01-09

Actions

Blocked by openQA Infrastructure (public) - action #125642: Manage "unified alerting" via salt size:M

Resolved

mkittler

2023-01-09

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by osukup almost 2 years ago

going trhoght journal on ow14 - no issues present , it looks like monitoring lost connection to ow14 and nothing else

Actions

Copy link

Updated by okurz almost 2 years ago

Target version set to Ready

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23 to ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23)
Description updated (diff)
Status changed from New to Blocked
Assignee set to nicksinger

@nicksinger please track this ticket as being blocked by #122845 which you are working on right now

Actions

Copy link

Updated by okurz almost 2 years ago

Related to action #122845: Migrate our Grafana setup to "unified alerting" added

Actions

Copy link

Updated by okurz almost 2 years ago

Priority changed from Normal to High

ok, it looks like we have multiple no data messages received also over the weekend so this is more pressing.

Actions

Copy link

Updated by mkittler almost 2 years ago

Status changed from Blocked to New

Not blocked by #122845. It is one particular aspect of it, though.

Actions

Copy link

Updated by mkittler almost 2 years ago

Subject changed from ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) to ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) size:M
Status changed from New to Workable

Actions

Copy link

Updated by okurz almost 2 years ago

Priority changed from High to Urgent

Lot's of emails ...

Actions

Copy link

Updated by okurz almost 2 years ago

Description updated (diff)
Priority changed from Urgent to High

I added a silence matching alertname=DatasourceNoData. That should help us for the time being until we understand better.

Actions

Copy link

#10

Updated by okurz almost 2 years ago

Discussed with nicksinger. The general silencing does not make sense as the "host up" alert was actually the only real alert where we would care about data because "average_response_ms" never returns a value if there is no response. However, it looks like we never challenged that query design which was in since 2019 in salt-states-openqa commit 5ae5356. So I deleted the generic silence again. Also openqa-piworker just reappeared after dheidler fixed the network config so I also unsilenced the specific alert about openqa-piworker.

Looking back into my email archive over the past days I could only find "FIRING.DatasourceNoData.*host up alert" which are the good ones. So it seems we never had an unintended message about NoData. Still, we can try to improve the alert by switching to "result_code" checking which apparently yields 0 in case of successful ping response and 1 otherwise. We changed the alert but as we already know the alert configuration is not saved in the exported json file. So that is something to continue … or manually change *all ping alerts to use "last of packets_received, alert if max is below 1"

Actions

Copy link

#11

Updated by nicksinger almost 2 years ago

Subject changed from ensure no "no data" alerts after migrating to unified alerting in grafana (was: openqaworker14: host up alert firing from 11:50 to 14:30 on 02.03.23) size:M to prevent confusing "no data" alerts size:M
Description updated (diff)
Status changed from Workable to Blocked
Priority changed from High to Low

Actions

Copy link

#12

Updated by nicksinger almost 2 years ago

Blocked by action #125642: Manage "unified alerting" via salt size:M added

Actions

Copy link

#13

Updated by okurz almost 2 years ago

Status changed from Blocked to New
Priority changed from Low to High

An additional problem seems to be that the weekly reboot of machines can trigger a lot of alerts. Maybe the pending period was not properly migrated and needs to be increased again

Actions

Copy link

#14

Updated by okurz almost 2 years ago

grafana 9.3.6 was built in https://build.opensuse.org/package/show/server:monitoring/grafana but not yet published so we can monitor http://download.opensuse.org/repositories/server:/monitoring/15.4/x86_64/?P=grafana* and upgrade as soon as published.

Actions

Copy link

#15

Updated by livdywan almost 2 years ago

Status changed from New to In Progress
Assignee changed from nicksinger to livdywan

We decided to rollback the alerts for now. This can be done by adjusting the config file so I'll take care of that step

Actions

Copy link

#16

Updated by livdywan almost 2 years ago

cdywan wrote:

We decided to rollback the alerts for now. This can be done by adjusting the config file so I'll take care of that step

For the record the way we write the config file it will be overridden fully which is why I decided against side-stepping salt and prepared an MR for it.

Actions

Copy link

#17

Updated by openqa_review almost 2 years ago

Due date set to 2023-03-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#18

Updated by livdywan almost 2 years ago

Status changed from In Progress to Workable
Assignee deleted (~~livdywan~~)

I assume we're good here. Unassigning for now. Somebody else may pick up for the next step.

Actions

Copy link

#19

Updated by okurz almost 2 years ago

The rollback had the effect that apparently the alert list on grafana does not show "unified alerting" alerts anymore and we receive emails in the old format again, like subject "[Alerting]" and "[Ok]" instead of "[FIRING]" and "[RESOLVED]". However https://monitor.qa.suse.de/alerting/list shows an error about "Failed to load Grafana rules state: 404 from rule state endpoint. Perhaps ruler API is not enabled". However that might also come from the automatic upgrade to grafana 9.3.6 which happened over night so I suggest to do a web research for that error message, enable a simple boolean option in the grafana config, restart the service and have it hopefully fixed.

Actions

Copy link

#20

Updated by mkittler almost 2 years ago

I'm not sure where you're seeing that error. It seems to be related to Loki which I don't thing we even use. I don't think there's a simple boolean flag to enable (unless we would actually use Loki).

Actions

Copy link

#21

Updated by nicksinger over 1 year ago

Assignee set to nicksinger

Actions

Copy link

#22

Updated by nicksinger over 1 year ago

Status changed from Workable to In Progress

Actions

Copy link

#23

Updated by okurz over 1 year ago

Due date changed from 2023-03-28 to 2023-04-07

discussed in daily infra call 2023-03-30

Actions

Copy link

#24

Updated by livdywan over 1 year ago

Status changed from In Progress to Resolved

AC1: We don't rely on "no data"-triggers for other purposes (e.g. host up, etc)

I double-checked in the team chat. We still see no data alerts e.g. today there was one *Queue: State (SUSE) alert * but not for the availability of hosts. So the AC is fulfilled. And we can come up with follow-up tickets as needed as part of the regular alert handling.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #125303

prevent confusing "no data" alerts size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by osukup almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by openqa_review almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by okurz almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago