Project

General

Profile

coordination #78206

Updated by livdywan about 4 years ago

## Observation 

 In 2020-11-18 a power outage happened in Nbg which caused the network and many services to shut down. o3 started just fine automatically. qsf-cluster.qa.suse.de shows only "first-test-vm", not more. openqa-monitor and backup are down as well as schort-server, changed to start automatically. openqaworker1 was power off, restarted over ipmi, might have been a deliberate action by IT team? Further things that we can improve? 


 ## Problems 

 * *DONE* qsf-cluster.qa.suse.de shows only "first-test-vm", not more 
 * *DONE* openqa-monitor and backup were down as well as schort-server -> changed to start automatically 
 * *DONE* wrong IPMI configuration for openqaworker2 -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/275 
 * openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled 
 * qa-power8-5 ipmi unstable 
 * powerqaworker-qam-1 powerqaworker-qa ipmi not reachable via IPMI 
 * IPMI information for grenache-1 missing completely in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/276 
 * grenache.qa as well as grenache-1.qa did not automatically start after powerup 


 ## Suggestions 

 * Conduct "fire drills", at least reboot of qsf-cluster and other machines more often. I think this supports the idea from okurz which we implemented to have automatic reboot of machines as we within the OSD infrastructure

Back