coordination #78206: [epic] 2020-11-18 nbg power outage aftermath - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

coordination #78206

closed

openQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] 2020-11-18 nbg power outage aftermath

Added by okurz over 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

-

Target version:

openQA Project (public) - Ready

Start date:

2020-11-27

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

Observation¶

In 2020-11-18 a power outage happened in Nbg which caused the network and many services to shut down. o3 started just fine automatically. qsf-cluster.qa.suse.de shows only "first-test-vm", not more. openqa-monitor and backup are down as well as schort-server, changed to start automatically. openqaworker1 was power off, restarted over ipmi, might have been a deliberate action by IT team? Further things that we can improve?

Problems¶

DONE qsf-cluster.qa.suse.de shows only "first-test-vm", not more
DONE openqa-monitor and backup were down as well as schort-server -> changed to start automatically
DONE wrong IPMI configuration for openqaworker2 -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/275
DONE IPMI information for grenache-1 missing completely in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/276
DONE openqaworker1 were "power off", restarted over ipmi, needs automatic on enabled -> #80542
DONE powerqaworker-qam-1 not reachable via IPMI -> #80544
DONE qa-power8-5 ipmi unstable -> #80544
DONE openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam, grenache.qa as well as grenache-1.qa were "power off", restarted over ipmi, needs automatic on enabled -> #92185

Suggestions¶

Conduct "fire drills", at least reboot of qsf-cluster and other machines more often. I think this supports the idea from okurz which we implemented to have automatic reboot of machines as we within the OSD infrastructure -> #80540

Subtasks 4 (0 open — 4 closed)

Related issues 1 (0 open — 1 closed)

Actions

#1

Updated by okurz over 4 years ago

Status changed from New to Feedback
Assignee set to okurz

waiting for more information, ideas, problems, feedback from others.

Actions

#2

Updated by okurz over 4 years ago

Description updated (diff)
Status changed from Feedback to In Progress
Priority changed from Normal to Urgent

To check and correct the power status of all OSD machines I did sed -n 's/^.*serial:.*`\(.*\) -H \(\S*\)\(\.suse\.de .*\) sol activate.*$/\1 -H \2\3 power status/p' <(curl -s https://gitlab.suse.de/openqa/salt-pillars-openqa/-/raw/master/openqa/workerconf.sls) and executed the commands

so some machines are not online and "qa-power8-5.qa.suse.de", "fsp1-powerqaworker-qam.qa.suse.de", "openqaworker-arm-3-ipmi.suse.de" (already known) could not be reached at all. Triggered the "power on" command for all. This surprisingly worked for "qa-power8-5.qa.suse.de" as well.

If you take a close look in the above list you realize that "openqaworker1" is mentioned twice but "openqaworker2" not at all. I have updated the description with the current state

Actions

#3

Updated by okurz over 4 years ago

Description updated (diff)
Priority changed from Urgent to High

all machines except the already problematic ones "powerqaworker-qa" and "openqaworker-arm-3" are back up.

Actions

#4

Updated by okurz over 4 years ago

Status changed from In Progress to Feedback

waiting for feedback on my MR and notes from others

Actions

#5

Updated by livdywan over 4 years ago

okurz wrote:

openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled

What does this mean? Automatic restarts? Via GitLab CI? Something else?

Actions

#6

Updated by livdywan over 4 years ago

Description updated (diff)

Actions

#7

Updated by okurz over 4 years ago

Related to action #78218: [openQA][worker] Almost all openQA workers become offline added

Actions

#8

Updated by okurz over 4 years ago

cdywan wrote:

okurz wrote:

openqaworker1, openqaworker-power8, malbec.arch, qa-power8-4, qa-power8-5, powerqaworker-qam were "power off", restarted over ipmi, needs automatic on enabled

What does this mean? Automatic restarts? Via GitLab CI? Something else?

No. This means that the firmware/BIOS/UEFI is configured to automatically power on the machine if power returns. Same as on any consumer PC.

Actions

#9

Updated by coolo over 4 years ago

s390zp18 is still offline, so I disabled those parts of grenache for the weekend - https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/465690c0f597fb3007efc7dedc30f4134b975a75

Actions

#10

Updated by okurz over 4 years ago

s390x hosts were re-enabled with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/281

malbec.arch and qa-power8-5 seem to be down. I expect that this is actually handled elsewhere but many people seem to be allergic to tickets ;)

Actions

#11

Updated by okurz over 4 years ago

Tracker changed from action to coordination
Subject changed from 2020-11-18 nbg power outage aftermath to [epic] 2020-11-18 nbg power outage aftermath
Description updated (diff)
Status changed from Feedback to Blocked
Parent task set to #80142

turned into epic with subtasks

Actions

#12

Updated by okurz about 4 years ago

Description updated (diff)

#80544 resolved

Actions

#13

Updated by okurz almost 4 years ago

Description updated (diff)

Actions

#14

Updated by okurz almost 4 years ago

Description updated (diff)

Actions

#15

Updated by okurz almost 4 years ago

Description updated (diff)

Actions

#16

Updated by okurz almost 4 years ago

Status changed from Blocked to Resolved

All done, I am happy with what we have now :)

Actions

Also available in: Atom PDF