Actions
action #151588
closed[potential-regression] Our salt node up check in osd-deployment never fails size:M
Start date:
2023-11-28
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Motivation¶
In #150983-5 we realized that our "check all salt nodes are up" never fails as visible in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2017020#L38 . Didn't we have that failing in the past? Likely we also don't want that step to fail on any random salt controlled host to be up. But what does ensure we know if systems are not connected over salt anymore?
Acceptance criteria¶
- AC1: We are alerted from somewhere if a host is not responsive, for example does not respond over salt
- AC2: We are also alerted if a host is reachable over ping but does not respond over salt, e.g. stuck in boot
- AC3: We have a workflow how to handle salt nodes which are down to unblock OSD deployment
Suggestions¶
- Review the git log in https://gitlab.suse.de/openqa/osd-deployment or ticket history about "check all salt nodes are up" if the design/change was intentional that the check for salt nodes never fails
- If we really do not have any better alerting in grafana and we definitely need to do that in gitlab CI then either make it fatal OR document clearly that it's logs to be checked manually
- Consider using https://build.opensuse.org/package/show/openSUSE:Factory/retry instead of the manual for loop
- Ensure that somewhere we have an alert or check that actually fails if not all salt nodes are reachable e.g. crosscheck the "host up" check and verify its operation by artificially triggering an alert
Out of scope¶
- We shouldn't care about the one salt bug which caused "no response" unlike "not connected". Unlikely to reappear and surprise us
Updated by okurz about 1 year ago
- Copied from action #150983: CPU Load and usage alert for openQA workers size:S added
Updated by okurz 11 months ago
- Copied to action #154627: [potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M added
Updated by okurz 8 months ago
- Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Actions