action #151588

Updated by okurz 6 months ago

## Motivation 
 In #150983-5 we realized that our "check all salt nodes are up" never fails as visible in . Didn't we have that failing in the past? Likely we also don't want that step to fail on any random salt controlled host to be up. But what *does* ensure we know if systems are not connected over salt anymore? 

 ## Acceptance criteria 
 * **AC1:** We are alerted from somewhere if a host is not responsive, for example does not respond over salt 
 * **AC2:** We are also alerted if a host is reachable over ping but does not respond over salt, e.g. stuck in boot 
 * **AC3:** We have a workflow how to handle salt nodes which are down to unblock OSD deployment 

 ## Suggestions 
 * Review the git log in or ticket history about "check all salt nodes are up" if the design/change was intentional that the check for salt nodes never fails 
   * If we really do not have any better alerting in grafana and we definitely need to do that in gitlab CI then either make it fatal OR document clearly that it's logs to be checked manually 
 * Consider using instead of the manual for loop 
 * Ensure that *somewhere* we have an alert or check that actually fails if not all salt nodes are reachable 

     e.g. crosscheck the "host up" check and verify its operation by artificially triggering an alert 

 ## Out of scope 
 * We shouldn't care about the one salt bug which caused "no response" unlike "not connected". Unlikely to reappear and surprise us 

 ## This is where Marius can add more