Actions
action #162380
closed2024-06-15 osd not accessible - causing false alerts for other hosts size:S
Status:
Rejected
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-17
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
I received alerts about "[FIRING:1] DatasourceNoData Salt (000000001 A Nk0h5mB4z [openqa] openqaworker-arm-1 online (long-time) alert)" starting some minutes after #162332 happened. This is not about openqaworker-arm-1 itself but that OSD itself is down and not delivering data. In other cases we already handled "NoData" as acceptable. Possibly we forgot to just do the same for the automatic recovery dashboard and need to do the same here as well.
Acceptance criteria¶
- AC1: outages of any other host do not trigger "no data" alerts for openqaworker-arm-1
- AC1: Long term unhandled outages of openqaworker-arm-1 still trigger alerts
Suggestions¶
- Take a look into the git history of https://gitlab.suse.de/openqa/salt-states-openqa for when we changed "no data" to "OK", e.g. see f7180fe. Then identify the according alert definitions related to the automatic recovery and adjust accordingly
Updated by okurz 6 months ago
- Copied from action #162332: 2024-06-15 osd not accessible size:M added
Updated by livdywan 4 months ago
- Subject changed from 2024-06-15 osd not accessible - causing false alerts about "[FIRING:1] DatasourceNoData Salt .*openqaworker-arm-1 online (long-time) alert" to 2024-06-15 osd not accessible - causing false alerts for other hosts size:S
- Status changed from New to Workable
Actions