Project

General

Profile

Actions

coordination #78206

closed

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] 2020-11-18 nbg power outage aftermath

Added by okurz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-11-27
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

In 2020-11-18 a power outage happened in Nbg which caused the network and many services to shut down. o3 started just fine automatically. qsf-cluster.qa.suse.de shows only "first-test-vm", not more. openqa-monitor and backup are down as well as schort-server, changed to start automatically. openqaworker1 was power off, restarted over ipmi, might have been a deliberate action by IT team? Further things that we can improve?

Problems

Suggestions

  • Conduct "fire drills", at least reboot of qsf-cluster and other machines more often. I think this supports the idea from okurz which we implemented to have automatic reboot of machines as we within the OSD infrastructure -> #80540

Subtasks 4 (0 open4 closed)

action #80540: idea: Conduct "power outage drills", e.g. once every half-year?Rejectedokurz2020-11-27

Actions
action #80542: Configure "automatic power-on" after power loss for openqaworker1 (and all others)Resolvedokurz2020-11-27

Actions
action #80544: Ensure that IPMI for powerqaworker-qam works reliablyResolvedokurz

Actions
action #92185: Our PowerPC worker machines need to be configured for automatic start after a power outageResolvedokurz2021-05-05

Actions

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolvedokurz2020-11-19

Actions
Actions

Also available in: Atom PDF