openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[epic] 2020-11-18 nbg power outage aftermath
In 2020-11-18 a power outage happened in Nbg which caused the network and many services to shut down. o3 started just fine automatically. qsf-cluster.qa.suse.de shows only "first-test-vm", not more. openqa-monitor and backup are down as well as schort-server, changed to start automatically. openqaworker1 was power off, restarted over ipmi, might have been a deliberate action by IT team? Further things that we can improve?
- DONE qsf-cluster.qa.suse.de shows only "first-test-vm", not more
- DONE openqa-monitor and backup were down as well as schort-server -> changed to start automatically
- DONE wrong IPMI configuration for openqaworker2 -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/275
- DONE IPMI information for grenache-1 missing completely in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/276
- DONE openqaworker1 were "power off", restarted over ipmi, needs automatic on enabled -> #80542
- DONE powerqaworker-qam-1 not reachable via IPMI -> #80544
- DONE qa-power8-5 ipmi unstable -> #80544
- DONE openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam, grenache.qa as well as grenache-1.qa were "power off", restarted over ipmi, needs automatic on enabled -> #92185
- Conduct "fire drills", at least reboot of qsf-cluster and other machines more often. I think this supports the idea from okurz which we implemented to have automatic reboot of machines as we within the OSD infrastructure -> #80540
- Description updated (diff)
- Status changed from Feedback to In Progress
- Priority changed from Normal to Urgent
To check and correct the power status of all OSD machines I did
sed -n 's/^.*serial:.*`\(.*\) -H \(\S*\)\(\.suse\.de .*\) sol activate.*$/\1 -H \2\3 power status/p' <(curl -s https://gitlab.suse.de/openqa/salt-pillars-openqa/-/raw/master/openqa/workerconf.sls) and executed the commands
so some machines are not online and "qa-power8-5.qa.suse.de", "fsp1-powerqaworker-qam.qa.suse.de", "openqaworker-arm-3-ipmi.suse.de" (already known) could not be reached at all. Triggered the "power on" command for all. This surprisingly worked for "qa-power8-5.qa.suse.de" as well.
If you take a close look in the above list you realize that "openqaworker1" is mentioned twice but "openqaworker2" not at all. I have updated the description with the current state
- openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled
What does this mean? Automatic restarts? Via GitLab CI? Something else?
No. This means that the firmware/BIOS/UEFI is configured to automatically power on the machine if power returns. Same as on any consumer PC.
s390zp18 is still offline, so I disabled those parts of grenache for the weekend - https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/465690c0f597fb3007efc7dedc30f4134b975a75
s390x hosts were re-enabled with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/281
malbec.arch and qa-power8-5 seem to be down. I expect that this is actually handled elsewhere but many people seem to be allergic to tickets ;)
- Tracker changed from action to coordination
- Subject changed from 2020-11-18 nbg power outage aftermath to [epic] 2020-11-18 nbg power outage aftermath
- Description updated (diff)
- Status changed from Feedback to Blocked
- Parent task set to #80142
turned into epic with subtasks