Project

General

Profile

action #80540

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #78206: [epic] 2020-11-18 nbg power outage aftermath

idea: Conduct "power outage drills", e.g. once every half-year?

Added by okurz 8 months ago. Updated about 2 months ago.

Status:
Rejected
Priority:
Low
Assignee:
Target version:
Start date:
2020-11-27
Due date:
% Done:

0%

Estimated time:

Description

Motivation

During the last "power outage" in Nbg data center we fared kinda ok but some machines and services did not come up by themselves due to simple things, e.g. not being configured to automatically power on after power is restored. See #78206 for details

Acceptance criteria

  • AC1: We have a schedule within the team SUSE QE Tools when we forcefully shut down all or selected systems to test our recovery plans
  • AC2: A simple requirements catalog for our services is listed what we need to ensure in service designs

Suggestion

  • Decide about a useful cadence, date and procedure
  • Add an according calendar appointment with reminder

History

#1 Updated by okurz about 2 months ago

  • Status changed from Workable to Rejected
  • Assignee set to okurz

As we often reboot our machines also to test this and as we managed to configure all openQA worker hosts to automatically power up after power is back and because we do not have direct physical control over all machines I think we can not do more reasonable things here, hence rejecting

Also available in: Atom PDF