Actions
action #133892
closed[alert] arm-worker2 (arm-worker2: host up alert openQA host_up_alert_arm-worker2 worker size:M
Start date:
2023-08-07
Due date:
2023-08-25
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
At Sat, 05 Aug 2023 19:55:08 +0200 we got an alert from Grafana.
1 firing alert instance
[IMAGE]
GROUPED BY
hostname=arm-worker2
1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
arm-worker2: host up alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
B0=1
Labels
alertname
arm-worker2: host up alert
grafana_folder
openQA
hostname
arm-worker2
rule_uid
host_up_alert_arm-worker2
type
worker
Annotations
message
No data received for pings from worker to central host, likely host is down (or split network). See https://progress.opensuse.org/issues/71098 for details
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_arm-worker2/view?orgId=1
I can't see anything in the panel, though.
According to the alarm history it is in alarm state already since August 3, nothing in the panel either, though.
Problem¶
The machine "arm-worker2" does not exist anymore and shouldn't exist, like "d105" and similar named machines, which were temporary DHCP lease names. Apparently alerts for not-in-salt machines are not removed anymore automatically.
Acceptance criteria¶
- AC1: Alert rules for hosts that are not salt controlled are removed automatically (or semi-automatically)
- AC2: There are no firing alerts anymore for not anymore existing hosts
Suggestions¶
- Find out where the still marked as "provisioned" alert rule https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_arm-worker2/view?orgId=1 comes from. Maybe something on monitor.qa.suse.de in /etc/telegraf/ or somewhere in gitlab.suse.de/openqa/salt-states-openqa
- Try to find a way that not anymore existing host alert rule references are automatically cleaned up. For example when a salt high state is applied in the pipeline of gitlab.suse.de/openqa/salt-states-openqa . If not possible to do fully automatic then provide instructions how to handle that in gitlab.suse.de/openqa/salt-states-openqa#openqa-salt-states README
- Clean up the current state
- Ensure that according alert rules are removed
- Ensure that no related alerts are still firing
Actions