coordination #78206
closed
openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[epic] 2020-11-18 nbg power outage aftermath
Added by okurz about 4 years ago.
Updated over 3 years ago.
Estimated time:
(Total: 0.00 h)
Description
Observation¶
In 2020-11-18 a power outage happened in Nbg which caused the network and many services to shut down. o3 started just fine automatically. qsf-cluster.qa.suse.de shows only "first-test-vm", not more. openqa-monitor and backup are down as well as schort-server, changed to start automatically. openqaworker1 was power off, restarted over ipmi, might have been a deliberate action by IT team? Further things that we can improve?
Problems¶
- DONE qsf-cluster.qa.suse.de shows only "first-test-vm", not more
- DONE openqa-monitor and backup were down as well as schort-server -> changed to start automatically
- DONE wrong IPMI configuration for openqaworker2 -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/275
- DONE IPMI information for grenache-1 missing completely in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/276
- DONE openqaworker1 were "power off", restarted over ipmi, needs automatic on enabled -> #80542
- DONE powerqaworker-qam-1 not reachable via IPMI -> #80544
- DONE qa-power8-5 ipmi unstable -> #80544
- DONE openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam, grenache.qa as well as grenache-1.qa were "power off", restarted over ipmi, needs automatic on enabled -> #92185
Suggestions¶
- Conduct "fire drills", at least reboot of qsf-cluster and other machines more often. I think this supports the idea from okurz which we implemented to have automatic reboot of machines as we within the OSD infrastructure -> #80540
- Status changed from New to Feedback
- Assignee set to okurz
waiting for more information, ideas, problems, feedback from others.
- Description updated (diff)
- Status changed from Feedback to In Progress
- Priority changed from Normal to Urgent
To check and correct the power status of all OSD machines I did sed -n 's/^.*serial:.*`\(.*\) -H \(\S*\)\(\.suse\.de .*\) sol activate.*$/\1 -H \2\3 power status/p' <(curl -s https://gitlab.suse.de/openqa/salt-pillars-openqa/-/raw/master/openqa/workerconf.sls)
and executed the commands
so some machines are not online and "qa-power8-5.qa.suse.de", "fsp1-powerqaworker-qam.qa.suse.de", "openqaworker-arm-3-ipmi.suse.de" (already known) could not be reached at all. Triggered the "power on" command for all. This surprisingly worked for "qa-power8-5.qa.suse.de" as well.
If you take a close look in the above list you realize that "openqaworker1" is mentioned twice but "openqaworker2" not at all. I have updated the description with the current state
- Description updated (diff)
- Priority changed from Urgent to High
all machines except the already problematic ones "powerqaworker-qa" and "openqaworker-arm-3" are back up.
- Status changed from In Progress to Feedback
waiting for feedback on my MR and notes from others
okurz wrote:
- openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled
What does this mean? Automatic restarts? Via GitLab CI? Something else?
- Description updated (diff)
- Related to action #78218: [openQA][worker] Almost all openQA workers become offline added
cdywan wrote:
okurz wrote:
- openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled
What does this mean? Automatic restarts? Via GitLab CI? Something else?
No. This means that the firmware/BIOS/UEFI is configured to automatically power on the machine if power returns. Same as on any consumer PC.
- Tracker changed from action to coordination
- Subject changed from 2020-11-18 nbg power outage aftermath to [epic] 2020-11-18 nbg power outage aftermath
- Description updated (diff)
- Status changed from Feedback to Blocked
- Parent task set to #80142
turned into epic with subtasks
- Description updated (diff)
- Description updated (diff)
- Description updated (diff)
- Description updated (diff)
- Status changed from Blocked to Resolved
All done, I am happy with what we have now :)
Also available in: Atom
PDF