Project

General

Profile

Actions

action #132620

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #130955: [epic] Migration out of SUSE NUE1 - QE setup in NUE3

Move of selected LSG QE machines NUE1 to NUE3 size:M

Added by okurz 10 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2023-07-12
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

NUE1 needs to be emptied. For some machines as part of the "disaster recovery" plans we opted to have them moved to "NUE3" aka. "Marienberg" or similar. Assuming nobody does the job for us we need to unrack and organize the move with Facilities and SUSE-IT.

Acceptance criteria

Suggestions

  • DONE Ask hreinecke and SUSE-IT and wengel and others how this move is organized
  • DONE As necessary organize transport of equipment: Create ticket over https://sd.suse.com to component "Facilities" asking them how and where to prepare machines for move and ask them to move the equipment to NUE3
  • DONE As necessary: Go to NUE1 Maxtorhof place beforehand and prepare the move, e.g. nothing connected anymore, put on pallet, labeled, packed into boxes, etc.
  • DONE Inform users about the pending move
  • Wait for the move to have happened
  • Ensure machines are usable from NUE3 as intended, i.e. in "cold storage"
  • Ensure our inventory management and documentation is up-to-date
  • Inform users after everything is done

Related issues 2 (0 open2 closed)

Copied from QA - action #132617: Move of selected LSG QE machines NUE1 to PRG2e size:MResolvedokurz

Actions
Copied to QA - action #132623: Decommissioning of selected selected LSQ QE machines from NUE1-SRV2Resolvedokurz2023-07-12

Actions
Actions #1

Updated by okurz 10 months ago

  • Copied from action #132617: Move of selected LSG QE machines NUE1 to PRG2e size:M added
Actions #2

Updated by okurz 10 months ago

  • Copied to action #132623: Decommissioning of selected selected LSQ QE machines from NUE1-SRV2 added
Actions #3

Updated by okurz 10 months ago

  • Assignee deleted (okurz)

Ready for work

Actions #4

Updated by livdywan 10 months ago

  • Status changed from New to Blocked
  • Assignee set to livdywan

As discussed I'm blocking this ticket (on #132671) to clarify how everyone in the team can login before we estimate it.

Actions #5

Updated by okurz 10 months ago

  • Status changed from Blocked to New
  • Assignee changed from livdywan to mgriessmeier
  • Target version changed from Ready to future

@mgriessmeier same as the other ticket you can help here

Actions #6

Updated by okurz 10 months ago

  • Tags set to infra

One important thing we should clarify: We assume that NUE3 will be available to us as just a standard online datacenter. If that is not the case, e.g. if it is planned to keep the complete datacenter dormant and in cold redundancy, then we should reconsider the plan at least for some machines.

Actions #7

Updated by okurz 9 months ago

It was clarified that NUE3 is actually planned to be "Cold redundancy only". Some are against that decision and ask for that to be reconsidered. I would just accept that as given and not question the decision assuming that the business has considered the pros and cons accordingly.
What does that mean for us regarding capacity, capability for growth to support testing of future products and disaster recovery?

  1. Capacity: We need to invest more effort and likely also incur cost for hot-redundancy of ppc64le+bare-metal+special-virtualization+aarch64. Additionally an additional free rack in NUE2 would be helpful as then we plan to accomodate more hardware in NUE2 FC Basement as part of a hot-geo-redundant setup
  2. Growth: Based on 1. "Capacity" no further impact regarding the potential to grow to support testing of any future products
  3. Disaster recovery: The reaction times in case of disaster from slow to fast are: "No redundancy": Need to buy new hardware, set it up; Expected resolution time: Weeks < "Cold redundancy": That is what we plan our mostly used and old hardware from NUE1-SRV1 for. In case of disaster some of those machines might not be directly usable but most should be fine. Expected resolution time: Days < "Hot redundancy": This is what we support and plan already anyway using labs but not for any public facing services. Expected resolution time: Hours
Actions #9

Updated by okurz 7 months ago

wengel created new Slack channel #dc-marienberg https://suse.slack.com/archives/C05UHQ49B7D

Upfront notice about move organisation

(Wolfgang Engel) Hi, I just created this slack channel to coordinate and discuss the move from Maxtorhof to Marienberg so everyone involved is on the same page. Welcome everyone, I just wanted to take this opportunity to get the people together that are involved into the move to Marienberg DC or who are going to move servers there. The build-ops team already collected their machines that need to be moved to Marienberg and they are ready for shippment. @Oliver Kurz since you are also moving machines there, we might coordinate the move of all machines so they are moved in batches. Please let us know if you need any assistance or when your are about to switch off your machines in SRV1 and un-rack them.
(Oliver Kurz) Sounds good. Thanks a lot for organizing this. Currently we still have machines in production use in both SRV1+SRV2 however we can power them off anytime when a move is imminent. As nobody from our team "SUSE LSG QE Tools" has access to SRV1 the best "lazy" option for us would be if somebody else, e.g. "you guys" or Eng-Infra or build-ops, would unrack and move all 22 machines from https://netbox.suse.de/dcim/devices/?tag=qe-lsg&tag=move-to-marienberg-dr (SRV1+SRV2). If that is not an option I am sure I can organize some helping hands to unrack and prepare machines from both locations at a time of your discretion when somebody at least opens the door to SRV1 and at best tells us where to store the machines like pallet or something (CC @Matthias Griessmeier @Nick Singer)

Actions #10

Updated by okurz 7 months ago

  • Tags changed from infra to infra, next-maxtorhof-visit
  • Status changed from New to In Progress
  • Assignee changed from mgriessmeier to okurz
  • Target version changed from future to Ready

Context: https://suse.slack.com/archives/C05UHQ49B7D/p1696418526499529 in #dc-marienberg

wengel mentioned volunteers that can help us to unrack machines but generally we should do it after getting access to NUE1-SRV1.

(Wolfgang Engel) Tomorrow the first batch of Servers from buildops will be send over to Marienberg. So we will organize a second batch.
(Oliver Kurz) ok. Should we try to prepare additional machines tomorrow morning to be ready at 10 or plan for the second batch, e.g. on 2023-10-09 0900L? in the end that would be 22 machines from https://netbox.suse.de/dcim/devices/?tag=qe-lsg&tag=move-to-marienberg-dr but I don't care if they are moved in 1st or later batch
(Wolfgang Engel) I will be there at around 8am. I will move some "Paletten" to the old All-Hands area where we can store your servers for the move.
(Oliver Kurz) alright. I will plan to be there some time around 0800-0900. Where can I find you as I only have access to SRV2 and office area?

Actions #11

Updated by okurz 7 months ago

  • Subject changed from Move of selected LSQ QE machines NUE1-SRV2 to NUE3 to Move of selected LSQ QE machines NUE1 to NUE3
Actions #12

Updated by okurz 7 months ago

wengel helped to unplug multiple machines already in preparation for the unracking. I removed openqaworker-arm-2 and openqaworker-arm-2 from salt as well now and applied a salt high state.

Actions #13

Updated by openqa_review 7 months ago

  • Due date set to 2023-10-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by okurz 7 months ago

All machines from https://netbox.suse.de/dcim/devices/?tag=qe-lsg&tag=move-to-marienberg-dr we put on pallets and prepared for move. The logistics company is picking up the machines. Likely all can be shipped in the first batch so it was good that we unracked all machines now.

I created virtual racks in both netbox and racktables https://netbox.suse.de/dcim/devices/?tag=qe-lsg&tag=move-to-marienberg-dr and moved machines to "Transit > NUE1-to-NUE3"

There was inconsistency regarding nessberry. https://netbox.suse.de/dcim/devices/9572/ states that an LPAR should go to PRG2e but the chassis to NUE3 which does not make sense. https://mysuse.sharepoint.com/sites/DatacentreTransformation/_layouts/15/Doc.aspx?sourcedoc={4B68F941-C7EC-431A-B3CF-875DBBCC6C83}&file=2023-01-16_2023Q1_PowerPC_capacity_planning_-_WIP_v0.7.0_jf.xlsx&action=default&mobileredirect=true&DefaultItemOpen=1 states NUE3. I chose PRG2e now and put the machine on that according pile.

Actions #15

Updated by okurz 7 months ago

  • Subject changed from Move of selected LSQ QE machines NUE1 to NUE3 to Move of selected LSG QE machines NUE1 to NUE3
Actions #16

Updated by okurz 7 months ago

  • Subject changed from Move of selected LSG QE machines NUE1 to NUE3 to Move of selected LSG QE machines NUE1 to NUE3 size:M
  • Description updated (diff)
Actions #17

Updated by okurz 7 months ago

  • Due date changed from 2023-10-19 to 2023-10-26

We have overlooked malbec.arch.suse.de which was listed on https://netbox.suse.de/dcim/devices/?tag=move-to-marienberg-dr but did not have the "QE LSG" tag. Added that. wengel in https://suse.slack.com/archives/C05UHQ49B7D/p1696515055956369 stated

So I will take care that those machines will be moved to NUE3 as well, thanks for noticing!

Regarding the openQA control instances moving all from powerqaworker-qam-1:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/639

Actions #18

Updated by okurz 7 months ago

  • Status changed from In Progress to Feedback

merged. svirt-hyperv2016 verification job: https://openqa.suse.de/tests/12391811#

Actions #19

Updated by okurz 7 months ago

  • Tags changed from infra, next-maxtorhof-visit to infra
Actions #20

Updated by okurz 7 months ago

okurz wrote in #note-17:

We have overlooked malbec.arch.suse.de which was listed on https://netbox.suse.de/dcim/devices/?tag=move-to-marienberg-dr but did not have the "QE LSG" tag. Added that. wengel in https://suse.slack.com/archives/C05UHQ49B7D/p1696515055956369 stated

So I will take care that those machines will be moved to NUE3 as well, thanks for noticing!

No response from wengel yet. https://netbox.suse.de/dcim/devices/3603/ still lists NUE1-SRV2. Still waiting for any feedback there.

Actions #21

Updated by okurz 7 months ago

  • Due date changed from 2023-10-26 to 2023-11-02
  • Priority changed from Normal to Low
  • Target version changed from Ready to Tools - Next
Actions #22

Updated by okurz 7 months ago

  • Due date deleted (2023-11-02)
  • Status changed from Feedback to Resolved
  • Target version changed from Tools - Next to Ready

https://netbox.suse.de/dcim/devices/3603/ malbec.arch.suse.de now lists Marienberg and is confirmed to be targeted for NUE3 as per https://suse.slack.com/archives/C05UHQ49B7D/p1698320881517679

(Wolfgang Engel) @Héctor Orón @Oliver Kurz I unracked malbec today and it is ready for moving to NUE3. I will let you know once we have a moving date. (…) racktables and netbox updated for malbec and systemdqa

https://netbox.suse.de/dcim/devices/?tag=qe-lsg&tag=move-to-marienberg-dr looks good now. There might be a mismatch about openqaworker19 that is on that list but listed with location PRG2 but at this point I think it's more cost-efficient to be flexible regardless where the machine pops up.

Given that all referenced machines are intended for disaster recovery cold storage and that there are no further plans right now about racking or connecting the machines in NUE3 I will call the story here resolved and only react on requests if at all in the future.

Actions

Also available in: Atom PDF