Actions
action #162380
closed2024-06-15 osd not accessible - causing false alerts for other hosts size:S
Status:
Rejected
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-17
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
I received alerts about "[FIRING:1] DatasourceNoData Salt (000000001 A Nk0h5mB4z [openqa] openqaworker-arm-1 online (long-time) alert)" starting some minutes after #162332 happened. This is not about openqaworker-arm-1 itself but that OSD itself is down and not delivering data. In other cases we already handled "NoData" as acceptable. Possibly we forgot to just do the same for the automatic recovery dashboard and need to do the same here as well.
Acceptance criteria¶
- AC1: outages of any other host do not trigger "no data" alerts for openqaworker-arm-1
- AC1: Long term unhandled outages of openqaworker-arm-1 still trigger alerts
Suggestions¶
- Take a look into the git history of https://gitlab.suse.de/openqa/salt-states-openqa for when we changed "no data" to "OK", e.g. see f7180fe. Then identify the according alert definitions related to the automatic recovery and adjust accordingly
Actions