action #156064: Handle planned datacenter shutdown PRG1 2024-02-28 size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #156064

closed

Handle planned datacenter shutdown PRG1 2024-02-28 size:M

Added by okurz 10 months ago. Updated 10 months ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2024-02-26

Due date:

% Done:

Estimated time:

Tags:

infra, maintenance, reactive work, power, prg1, planned

Description

Motivation¶

https://suse.slack.com/archives/C02AET1AAAD/p1708950623186539

(Moroni Flores)
* From February 28, 2024, starting at 18:00 CET (Feb 28 10:00 MST; Feb 29 1:00 CST)
* To February 29, 8:00 CET (Feb 29 0:00 MST; Feb 29 15:00 CST)
:round_pushpin: Location: Prague office server room (PRG1)
:wrench: Maintenance: Power outage in the PRG1 server room
:rotating_light: Impact: ALL services in PRG1 will be affectedOn February 28 evening there is planned a power outage in the Prague office building and surrounding neighbourhood.In preparation to this, all the machines in the Prague office server room (PRG1) must be shut down. The Infra team will start shutting down the local equipment from 18:00 CET.
We ask each of you who has a server in PRG1 to please shut them down before 18:00 CET.The Infra team will restart the servers and services gradually from February 29, 8:00 CET. In case a machine or service is still missing from 9:00 CET please let us know via SD ticket.Thanks,
SUSE IT Infra :suselogo:

Acceptance criteria¶

AC1: Servers maintained by QE Tools in PRG1 are operational again as in before after the shutdown
AC2: No unused machines are powered on and wasting power

Suggestions¶

Identify all impacted machines on https://racktables.nue.suse.com/
Inform users
Consider shutting down all QE tools maintained machines before the shutdown in a controlled manner
Await shutdown to be over
Bring up all production machines and ensure operation
Ensure that no unused machines are powered on and wasting power

Actions

Copy link

Updated by okurz 10 months ago

Subject changed from Handle planned datacenter shutdown PRG1 2024-02-28 to Handle planned datacenter shutdown PRG1 2024-02-28 size:M
Description updated (diff)
Status changed from New to Workable
Assignee set to nicksinger

Actions

Copy link

Updated by okurz 10 months ago

From https://racktables.nue.suse.com/index.php?andor=and&cft%5B%5D=34&cfe=&page=rackspace&tab=default&submit.x=11&submit.y=22 in the bottom right select only "PRG Server Room 1" which should leave you with only a couple of rows. It seems only PRG-SRV1-A, PRG-SRV1-B, PRG-SRV1-C are relevant.
In the three mentioned rows I found 3x7 racks. I opened all in browser tabs and crosschecked all entries. I guess most are simply not relevant for us.
So only the following racks are remotely relevant to us:

Most "qam.suse.cz" machines likely don't have "QE LSG" and are maintained by mpluskal I assume so I wouldn't care about those but it's a good opportunity to crosscheck.

So my suggestion is:

For the machines that we have direct IPMI control over – so the ones mentioned in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls – power them down and remove from production
For all others ping "contact person" and tell them to handle that as we won't do

Actions

Copy link

Updated by nicksinger 10 months ago

used

for i in openqaworker14.qa.suse.cz openqaworker15.qa.suse.cz openqaworker16.qa.suse.cz openqaworker17.qa.suse.cz openqaworker18.qa.suse.cz qesapworker-prg4.qa.suse.cz qesapworker-prg5.qa.suse.cz qesapworker-prg6.qa.suse.cz qesapworker-prg7.qa.suse.cz; do ssh root@openqa.suse.de "salt-key -y -d $i"; done

to remove all machines according to https://progress.opensuse.org/projects/openqav3/wiki#Take-machines-out-of-salt-controlled-production
and used

cat git/salt-pillars-openqa/openqa/workerconf.sls | grep "## serial:" | cut -d "\`" -f 2- | grep suse.cz | rev | cut -d "\`" -f 2- | rev | sed "s/sol activate/chassis power off/g"

to generate a command to power off all machines.
Also created a general silence in https://stats.openqa-monitor.qa.suse.de/alerting/silences for all hosts affected.

Actions

Copy link

Updated by nicksinger 10 months ago

Status changed from Workable to Feedback

Silence ran out by now, I was able to access our machines again and powered them all back on again with a similar command as above. Also enabled the salt keys again. Checking for some feedback. I saw a couple of MM issues but not sure if these are related to the shutdown.

Actions

Copy link

Updated by nicksinger 10 months ago

Status changed from Feedback to Resolved

Systems seem fine. Please reopen if you suspect problems due to the shutdown.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #156064

Handle planned datacenter shutdown PRG1 2024-02-28 size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by nicksinger 10 months ago

Updated by nicksinger 10 months ago

Updated by nicksinger 10 months ago