Project

General

Profile

Actions

coordination #78206

closed

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] 2020-11-18 nbg power outage aftermath

Added by okurz over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-11-27
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

In 2020-11-18 a power outage happened in Nbg which caused the network and many services to shut down. o3 started just fine automatically. qsf-cluster.qa.suse.de shows only "first-test-vm", not more. openqa-monitor and backup are down as well as schort-server, changed to start automatically. openqaworker1 was power off, restarted over ipmi, might have been a deliberate action by IT team? Further things that we can improve?

Problems

Suggestions

  • Conduct "fire drills", at least reboot of qsf-cluster and other machines more often. I think this supports the idea from okurz which we implemented to have automatic reboot of machines as we within the OSD infrastructure -> #80540

Subtasks 4 (0 open4 closed)

action #80540: idea: Conduct "power outage drills", e.g. once every half-year?Rejectedokurz2020-11-27

Actions
action #80542: Configure "automatic power-on" after power loss for openqaworker1 (and all others)Resolvedokurz2020-11-27

Actions
action #80544: Ensure that IPMI for powerqaworker-qam works reliablyResolvedokurz

Actions
action #92185: Our PowerPC worker machines need to be configured for automatic start after a power outageResolvedokurz2021-05-05

Actions

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolvedokurz2020-11-19

Actions
Actions #1

Updated by okurz over 3 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz

waiting for more information, ideas, problems, feedback from others.

Actions #2

Updated by okurz over 3 years ago

  • Description updated (diff)
  • Status changed from Feedback to In Progress
  • Priority changed from Normal to Urgent

To check and correct the power status of all OSD machines I did sed -n 's/^.*serial:.*`\(.*\) -H \(\S*\)\(\.suse\.de .*\) sol activate.*$/\1 -H \2\3 power status/p' <(curl -s https://gitlab.suse.de/openqa/salt-pillars-openqa/-/raw/master/openqa/workerconf.sls) and executed the commands

so some machines are not online and "qa-power8-5.qa.suse.de", "fsp1-powerqaworker-qam.qa.suse.de", "openqaworker-arm-3-ipmi.suse.de" (already known) could not be reached at all. Triggered the "power on" command for all. This surprisingly worked for "qa-power8-5.qa.suse.de" as well.

If you take a close look in the above list you realize that "openqaworker1" is mentioned twice but "openqaworker2" not at all. I have updated the description with the current state

Actions #3

Updated by okurz over 3 years ago

  • Description updated (diff)
  • Priority changed from Urgent to High

all machines except the already problematic ones "powerqaworker-qa" and "openqaworker-arm-3" are back up.

Actions #4

Updated by okurz over 3 years ago

  • Status changed from In Progress to Feedback

waiting for feedback on my MR and notes from others

Actions #5

Updated by livdywan over 3 years ago

okurz wrote:

  • openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled

What does this mean? Automatic restarts? Via GitLab CI? Something else?

Actions #6

Updated by livdywan over 3 years ago

  • Description updated (diff)
Actions #7

Updated by okurz over 3 years ago

  • Related to action #78218: [openQA][worker] Almost all openQA workers become offline added
Actions #8

Updated by okurz over 3 years ago

cdywan wrote:

okurz wrote:

  • openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled

What does this mean? Automatic restarts? Via GitLab CI? Something else?

No. This means that the firmware/BIOS/UEFI is configured to automatically power on the machine if power returns. Same as on any consumer PC.

Actions #9

Updated by coolo over 3 years ago

s390zp18 is still offline, so I disabled those parts of grenache for the weekend - https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/465690c0f597fb3007efc7dedc30f4134b975a75

Actions #10

Updated by okurz over 3 years ago

s390x hosts were re-enabled with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/281

malbec.arch and qa-power8-5 seem to be down. I expect that this is actually handled elsewhere but many people seem to be allergic to tickets ;)

Actions #11

Updated by okurz over 3 years ago

  • Tracker changed from action to coordination
  • Subject changed from 2020-11-18 nbg power outage aftermath to [epic] 2020-11-18 nbg power outage aftermath
  • Description updated (diff)
  • Status changed from Feedback to Blocked
  • Parent task set to #80142

turned into epic with subtasks

Actions #12

Updated by okurz about 3 years ago

  • Description updated (diff)

#80544 resolved

Actions #13

Updated by okurz about 3 years ago

  • Description updated (diff)
Actions #14

Updated by okurz almost 3 years ago

  • Description updated (diff)
Actions #15

Updated by okurz almost 3 years ago

  • Description updated (diff)
Actions #16

Updated by okurz almost 3 years ago

  • Status changed from Blocked to Resolved

All done, I am happy with what we have now :)

Actions

Also available in: Atom PDF