action #156064
closedHandle planned datacenter shutdown PRG1 2024-02-28 size:M
0%
Description
Motivation¶
https://suse.slack.com/archives/C02AET1AAAD/p1708950623186539
(Moroni Flores)
* From February 28, 2024, starting at 18:00 CET (Feb 28 10:00 MST; Feb 29 1:00 CST)
* To February 29, 8:00 CET (Feb 29 0:00 MST; Feb 29 15:00 CST)
:round_pushpin: Location: Prague office server room (PRG1)
:wrench: Maintenance: Power outage in the PRG1 server room
:rotating_light: Impact: ALL services in PRG1 will be affectedOn February 28 evening there is planned a power outage in the Prague office building and surrounding neighbourhood.In preparation to this, all the machines in the Prague office server room (PRG1) must be shut down. The Infra team will start shutting down the local equipment from 18:00 CET.
We ask each of you who has a server in PRG1 to please shut them down before 18:00 CET.The Infra team will restart the servers and services gradually from February 29, 8:00 CET. In case a machine or service is still missing from 9:00 CET please let us know via SD ticket.Thanks,
SUSE IT Infra :suselogo:
Acceptance criteria¶
- AC1: Servers maintained by QE Tools in PRG1 are operational again as in before after the shutdown
- AC2: No unused machines are powered on and wasting power
Suggestions¶
- Identify all impacted machines on https://racktables.nue.suse.com/
- Inform users
- Consider shutting down all QE tools maintained machines before the shutdown in a controlled manner
- Await shutdown to be over
- Bring up all production machines and ensure operation
- Ensure that no unused machines are powered on and wasting power
Updated by okurz 8 months ago
From https://racktables.nue.suse.com/index.php?andor=and&cft%5B%5D=34&cfe=&page=rackspace&tab=default&submit.x=11&submit.y=22 in the bottom right select only "PRG Server Room 1" which should leave you with only a couple of rows. It seems only PRG-SRV1-A, PRG-SRV1-B, PRG-SRV1-C are relevant.
In the three mentioned rows I found 3x7 racks. I opened all in browser tabs and crosschecked all entries. I guess most are simply not relevant for us.
So only the following racks are remotely relevant to us:
Most "qam.suse.cz" machines likely don't have "QE LSG" and are maintained by mpluskal I assume so I wouldn't care about those but it's a good opportunity to crosscheck.
So my suggestion is:
- For the machines that we have direct IPMI control over – so the ones mentioned in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls – power them down and remove from production
- For all others ping "contact person" and tell them to handle that as we won't do
Updated by nicksinger 7 months ago
used
for i in openqaworker14.qa.suse.cz openqaworker15.qa.suse.cz openqaworker16.qa.suse.cz openqaworker17.qa.suse.cz openqaworker18.qa.suse.cz qesapworker-prg4.qa.suse.cz qesapworker-prg5.qa.suse.cz qesapworker-prg6.qa.suse.cz qesapworker-prg7.qa.suse.cz; do ssh root@openqa.suse.de "salt-key -y -d $i"; done
to remove all machines according to https://progress.opensuse.org/projects/openqav3/wiki#Take-machines-out-of-salt-controlled-production
and used
cat git/salt-pillars-openqa/openqa/workerconf.sls | grep "## serial:" | cut -d "\`" -f 2- | grep suse.cz | rev | cut -d "\`" -f 2- | rev | sed "s/sol activate/chassis power off/g"
to generate a command to power off all machines.
Also created a general silence in https://stats.openqa-monitor.qa.suse.de/alerting/silences for all hosts affected.
Updated by nicksinger 7 months ago
- Status changed from Workable to Feedback
Silence ran out by now, I was able to access our machines again and powered them all back on again with a similar command as above. Also enabled the salt keys again. Checking for some feedback. I saw a couple of MM issues but not sure if these are related to the shutdown.
Updated by nicksinger 7 months ago
- Status changed from Feedback to Resolved
Systems seem fine. Please reopen if you suspect problems due to the shutdown.