Project

General

Profile

Actions

action #133892

closed

[alert] arm-worker2 (arm-worker2: host up alert openQA host_up_alert_arm-worker2 worker size:M

Added by tinita over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-08-07
Due date:
2023-08-25
% Done:

0%

Estimated time:

Description

Observation

At Sat, 05 Aug 2023 19:55:08 +0200 we got an alert from Grafana.

1 firing alert instance
[IMAGE]
 GROUPED BY 

hostname=arm-worker2

 1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
arm-worker2: host up alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
B0=1 
Labels
alertname
arm-worker2: host up alert
grafana_folder
openQA
hostname
arm-worker2
rule_uid
host_up_alert_arm-worker2
type
worker
Annotations
message
No data received for pings from worker to central host, likely host is down (or split network). See https://progress.opensuse.org/issues/71098 for details

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_arm-worker2/view?orgId=1

I can't see anything in the panel, though.

According to the alarm history it is in alarm state already since August 3, nothing in the panel either, though.

Problem

The machine "arm-worker2" does not exist anymore and shouldn't exist, like "d105" and similar named machines, which were temporary DHCP lease names. Apparently alerts for not-in-salt machines are not removed anymore automatically.

Acceptance criteria

  • AC1: Alert rules for hosts that are not salt controlled are removed automatically (or semi-automatically)
  • AC2: There are no firing alerts anymore for not anymore existing hosts

Suggestions

  • Find out where the still marked as "provisioned" alert rule https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_arm-worker2/view?orgId=1 comes from. Maybe something on monitor.qa.suse.de in /etc/telegraf/ or somewhere in gitlab.suse.de/openqa/salt-states-openqa
  • Try to find a way that not anymore existing host alert rule references are automatically cleaned up. For example when a salt high state is applied in the pipeline of gitlab.suse.de/openqa/salt-states-openqa . If not possible to do fully automatic then provide instructions how to handle that in gitlab.suse.de/openqa/salt-states-openqa#openqa-salt-states README
  • Clean up the current state
  • Ensure that according alert rules are removed
  • Ensure that no related alerts are still firing

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #132137: Setup new PRG2 openQA worker for osd size:MResolvedmkittler2023-06-29

Actions
Actions #1

Updated by tinita over 1 year ago

  • Description updated (diff)
Actions #2

Updated by tinita over 1 year ago

  • Description updated (diff)
Actions #3

Updated by okurz over 1 year ago

  • Tags set to infra, alert, prg2, reactive work
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #4

Updated by okurz over 1 year ago

  • Related to action #132137: Setup new PRG2 openQA worker for osd size:M added
Actions #5

Updated by mkittler over 1 year ago

  • Subject changed from [alert] arm-worker2 (arm-worker2: host up alert openQA host_up_alert_arm-worker2 worker to [alert] arm-worker2 (arm-worker2: host up alert openQA host_up_alert_arm-worker2 worker size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #7

Updated by mkittler over 1 year ago

  • Status changed from Workable to In Progress
Actions #9

Updated by openqa_review over 1 year ago

  • Due date set to 2023-08-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

The MR has been merged.

Since Grafana still assumed those alerts were provisioned I followed the steps in our README and ended up deleting them via:

sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
  delete from alert_rule where uid like '%_arm-worker2%';
  delete from alert_rule_version where rule_uid like '%_arm-worker2';
  delete from provenance_type where record_key like '%_arm-worker2';
  delete from annotation where text like '%_arm-worker2%';
 "
sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
  delete from alert_rule where uid regexp '.*_alert_d\d\d\d';
  delete from alert_rule_version where rule_uid regexp '.*_alert_d\d\d\d';
  delete from provenance_type where record_key regexp '.*_alert_d\d\d\d';
  delete from annotation where text regexp '.*_d\d\d\d';
 "

I've also deleted dashboards. There we already had the deletion in-place and Grafana noticed that the dashboards are not provisioned anymore so one could delete them via the web UI.

Actions #11

Updated by mkittler over 1 year ago

I guess there's nothing we can do about that limitation in Grafana except documenting a workaround. We already have documentation but I've just created another MR with an example how to remove multiple alerts in one go: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/940

I've also simplified the silence again: https://stats.openqa-monitor.qa.suse.de/alerting/silence/444a55a0-8fce-43fd-8a85-50c817d0f46d/edit?alertmanager=grafana

Actions #12

Updated by mkittler about 1 year ago

  • Status changed from Feedback to Resolved

The documentation change has been merged. With that I'm considering this resolved.

Actions

Also available in: Atom PDF