Project

General

Profile

Actions

action #156064

closed

Handle planned datacenter shutdown PRG1 2024-02-28 size:M

Added by okurz 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2024-02-26
Due date:
% Done:

0%

Estimated time:

Description

Motivation

https://suse.slack.com/archives/C02AET1AAAD/p1708950623186539

(Moroni Flores)
* From February 28, 2024, starting at 18:00 CET (Feb 28 10:00 MST; Feb 29 1:00 CST)
* To February 29, 8:00 CET (Feb 29 0:00 MST; Feb 29 15:00 CST)
:round_pushpin: Location: Prague office server room (PRG1)
:wrench: Maintenance: Power outage in the PRG1 server room
:rotating_light: Impact: ALL services in PRG1 will be affectedOn February 28 evening there is planned a power outage in the Prague office building and surrounding neighbourhood.In preparation to this, all the machines in the Prague office server room (PRG1) must be shut down. The Infra team will start shutting down the local equipment from 18:00 CET.
We ask each of you who has a server in PRG1 to please shut them down before 18:00 CET.The Infra team will restart the servers and services gradually from February 29, 8:00 CET. In case a machine or service is still missing from 9:00 CET please let us know via SD ticket.Thanks,
SUSE IT Infra :suselogo:

Acceptance criteria

  • AC1: Servers maintained by QE Tools in PRG1 are operational again as in before after the shutdown
  • AC2: No unused machines are powered on and wasting power

Suggestions

  • Identify all impacted machines on https://racktables.nue.suse.com/
  • Inform users
  • Consider shutting down all QE tools maintained machines before the shutdown in a controlled manner
  • Await shutdown to be over
  • Bring up all production machines and ensure operation
  • Ensure that no unused machines are powered on and wasting power
Actions #1

Updated by okurz 2 months ago

  • Subject changed from Handle planned datacenter shutdown PRG1 2024-02-28 to Handle planned datacenter shutdown PRG1 2024-02-28 size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to nicksinger
Actions #3

Updated by okurz 2 months ago

From https://racktables.nue.suse.com/index.php?andor=and&cft%5B%5D=34&cfe=&page=rackspace&tab=default&submit.x=11&submit.y=22 in the bottom right select only "PRG Server Room 1" which should leave you with only a couple of rows. It seems only PRG-SRV1-A, PRG-SRV1-B, PRG-SRV1-C are relevant.
In the three mentioned rows I found 3x7 racks. I opened all in browser tabs and crosschecked all entries. I guess most are simply not relevant for us.
So only the following racks are remotely relevant to us:

Most "qam.suse.cz" machines likely don't have "QE LSG" and are maintained by mpluskal I assume so I wouldn't care about those but it's a good opportunity to crosscheck.

So my suggestion is:

  1. For the machines that we have direct IPMI control over – so the ones mentioned in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls – power them down and remove from production
  2. For all others ping "contact person" and tell them to handle that as we won't do
Actions #4

Updated by nicksinger 2 months ago

used

for i in openqaworker14.qa.suse.cz openqaworker15.qa.suse.cz openqaworker16.qa.suse.cz openqaworker17.qa.suse.cz openqaworker18.qa.suse.cz qesapworker-prg4.qa.suse.cz qesapworker-prg5.qa.suse.cz qesapworker-prg6.qa.suse.cz qesapworker-prg7.qa.suse.cz; do ssh root@openqa.suse.de "salt-key -y -d $i"; done

to remove all machines according to https://progress.opensuse.org/projects/openqav3/wiki#Take-machines-out-of-salt-controlled-production
and used

cat git/salt-pillars-openqa/openqa/workerconf.sls | grep "## serial:" | cut -d "\`" -f 2- | grep suse.cz | rev | cut -d "\`" -f 2- | rev | sed "s/sol activate/chassis power off/g"

to generate a command to power off all machines.
Also created a general silence in https://stats.openqa-monitor.qa.suse.de/alerting/silences for all hosts affected.

Actions #5

Updated by nicksinger about 2 months ago

  • Status changed from Workable to Feedback

Silence ran out by now, I was able to access our machines again and powered them all back on again with a similar command as above. Also enabled the salt keys again. Checking for some feedback. I saw a couple of MM issues but not sure if these are related to the shutdown.

Actions #6

Updated by nicksinger about 2 months ago

  • Status changed from Feedback to Resolved

Systems seem fine. Please reopen if you suspect problems due to the shutdown.

Actions

Also available in: Atom PDF