Project

General

Profile

coordination #78206

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] 2020-11-18 nbg power outage aftermath

Added by okurz 11 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-11-27
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

In 2020-11-18 a power outage happened in Nbg which caused the network and many services to shut down. o3 started just fine automatically. qsf-cluster.qa.suse.de shows only "first-test-vm", not more. openqa-monitor and backup are down as well as schort-server, changed to start automatically. openqaworker1 was power off, restarted over ipmi, might have been a deliberate action by IT team? Further things that we can improve?

Problems

Suggestions

  • Conduct "fire drills", at least reboot of qsf-cluster and other machines more often. I think this supports the idea from okurz which we implemented to have automatic reboot of machines as we within the OSD infrastructure -> #80540

Subtasks

action #80540: idea: Conduct "power outage drills", e.g. once every half-year?Rejectedokurz

action #80542: Configure "automatic power-on" after power loss for openqaworker1 (and all others)Resolvedokurz

action #80544: Ensure that IPMI for powerqaworker-qam works reliablyResolvedokurz

action #92185: Our PowerPC worker machines need to be configured for automatic start after a power outageResolvedokurz


Related issues

Related to openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolved2020-11-19

History

#1 Updated by okurz 11 months ago

  • Status changed from New to Feedback
  • Assignee set to okurz

waiting for more information, ideas, problems, feedback from others.

#2 Updated by okurz 11 months ago

  • Description updated (diff)
  • Status changed from Feedback to In Progress
  • Priority changed from Normal to Urgent

To check and correct the power status of all OSD machines I did sed -n 's/^.*serial:.*`\(.*\) -H \(\S*\)\(\.suse\.de .*\) sol activate.*$/\1 -H \2\3 power status/p' <(curl -s https://gitlab.suse.de/openqa/salt-pillars-openqa/-/raw/master/openqa/workerconf.sls) and executed the commands

so some machines are not online and "qa-power8-5.qa.suse.de", "fsp1-powerqaworker-qam.qa.suse.de", "openqaworker-arm-3-ipmi.suse.de" (already known) could not be reached at all. Triggered the "power on" command for all. This surprisingly worked for "qa-power8-5.qa.suse.de" as well.

If you take a close look in the above list you realize that "openqaworker1" is mentioned twice but "openqaworker2" not at all. I have updated the description with the current state

#3 Updated by okurz 11 months ago

  • Description updated (diff)
  • Priority changed from Urgent to High

all machines except the already problematic ones "powerqaworker-qa" and "openqaworker-arm-3" are back up.

#4 Updated by okurz 11 months ago

  • Status changed from In Progress to Feedback

waiting for feedback on my MR and notes from others

#5 Updated by cdywan 11 months ago

okurz wrote:

  • openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled

What does this mean? Automatic restarts? Via GitLab CI? Something else?

#6 Updated by cdywan 11 months ago

  • Description updated (diff)

#7 Updated by okurz 11 months ago

  • Related to action #78218: [openQA][worker] Almost all openQA workers become offline added

#8 Updated by okurz 11 months ago

cdywan wrote:

okurz wrote:

  • openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled

What does this mean? Automatic restarts? Via GitLab CI? Something else?

No. This means that the firmware/BIOS/UEFI is configured to automatically power on the machine if power returns. Same as on any consumer PC.

#9 Updated by coolo 11 months ago

s390zp18 is still offline, so I disabled those parts of grenache for the weekend - https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/465690c0f597fb3007efc7dedc30f4134b975a75

#10 Updated by okurz 11 months ago

s390x hosts were re-enabled with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/281

malbec.arch and qa-power8-5 seem to be down. I expect that this is actually handled elsewhere but many people seem to be allergic to tickets ;)

#11 Updated by okurz 11 months ago

  • Tracker changed from action to coordination
  • Subject changed from 2020-11-18 nbg power outage aftermath to [epic] 2020-11-18 nbg power outage aftermath
  • Description updated (diff)
  • Status changed from Feedback to Blocked
  • Parent task set to #80142

turned into epic with subtasks

#12 Updated by okurz 7 months ago

  • Description updated (diff)

#80544 resolved

#13 Updated by okurz 7 months ago

  • Description updated (diff)

#14 Updated by okurz 6 months ago

  • Description updated (diff)

#15 Updated by okurz 5 months ago

  • Description updated (diff)

#16 Updated by okurz 5 months ago

  • Status changed from Blocked to Resolved

All done, I am happy with what we have now :)

Also available in: Atom PDF