coordination #78206

Updated by okurz 6 months ago

## Observation

In 2020-11-18 a power outage happened in Nbg which caused the network and many services to shut down. o3 started just fine automatically. shows only "first-test-vm", not more. openqa-monitor and backup are down as well as schort-server, changed to start automatically. openqaworker1 was power off, restarted over ipmi, might have been a deliberate action by IT team? Further things that we can improve?

## Problems

* *DONE* shows only "first-test-vm", not more
* *DONE* openqa-monitor and backup were down as well as schort-server -> changed to start automatically
* *DONE* wrong IPMI configuration for openqaworker2 ->
* *DONE* IPMI information for grenache-1 missing completely in ->
* openqaworker1 were "power off", restarted over ipmi, needs automatic on enabled -> #80542
* *DONE* powerqaworker-qam-1 not reachable via IPMI -> #80544
* openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled
* qa-power8-5 ipmi unstable
* as well as did not automatically start after powerup

## Suggestions

* Conduct "fire drills", at least reboot of qsf-cluster and other machines more often. I think this supports the idea from okurz which we implemented to have automatic reboot of machines as we within the OSD infrastructure -> #80540