Project

General

Profile

Actions

action #102650

closed

Organize labs move to new building and SRV2 size:M

Added by nicksinger over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2021-11-18
Due date:
2022-05-27
% Done:

0%

Estimated time:

Description

It is planned to move the labs to our new office space. Planning for this is currently ongoing and driven by external partners. From within QA this is driven by Ralf, Matthias and Me. There where several planning meetings in the past and several excel sheets over mail so not much I can put here. From what I can say at the moment we plan to decommission a lot of old hardware, move production stuff into SRV2@Maxtorhof and move the rest into the new lab.

If you are from QA, have hardware in the Labs and are not involved yet feel free to add comments and your requirements here.

Acceptance criteria

  • AC1: QA labs are not present in racktables anymore

Related issues 4 (1 open3 closed)

Related to openQA Project - action #106056: [virtualization][tools] Improve retry behaviour and connection error handling in backend::ipmi (was: "Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment") size:MWorkable2022-02-07

Actions
Related to openQA Infrastructure - action #107437: [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:MResolvedokurz2022-02-23

Actions
Related to openQA Infrastructure - action #107257: [alert][osd] Apache Response Time alert size:MResolvedokurz2022-02-22

Actions
Copied to openQA Infrastructure - action #109746: Improve QA related server room management, consistent naming and tagging size:MResolvedokurz

Actions
Actions #1

Updated by nicksinger over 2 years ago

  • Project changed from openQA Project to openQA Infrastructure
Actions #2

Updated by okurz over 2 years ago

  • Subject changed from Organize labs move to new bulding and SRV2 to Organize labs move to new building and SRV2
  • Target version set to Ready
Actions #3

Updated by openqa_review over 2 years ago

  • Due date set to 2021-12-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by okurz over 2 years ago

  • Due date changed from 2021-12-03 to 2022-01-31
Actions #5

Updated by nicksinger about 2 years ago

quick update here: we trashed several servers in the big qa lab. These machines are tagged with the "decommissioned" tag and can be seen here: https://racktables.suse.de/index.php?page=rack&rack_id=15356 and here https://racktables.suse.de/index.php?page=row&row_id=934

Actions #6

Updated by okurz about 2 years ago

  • Due date deleted (2022-01-31)
  • Status changed from In Progress to Feedback
  • Priority changed from High to Low

Discussed in weekly unblock 2022-01-19. #102650#note-5 describes the current state. We agreed that we currently don't do active work but wait for people from facilities and SUSE-IT to ask us or in particular nsinger to conduct any potential next step.

Actions #7

Updated by nicksinger about 2 years ago

I got approached with the request to merge the small lab into the big lab. This will be done by me and @mgriessmeier on Monday the 1st of February. Planned move date is the 1st of March but Frankencampus will most likely be not ready at that date. Therefore a final move date is not set yet.

Actions #8

Updated by nicksinger about 2 years ago

We need to move our big QA lab into SRV2. Our current plan is that we collect the required specs (HU’s, power sockets, network sockets) before Wed 16th. On Wed we will - together with gschlotter - take a look into SRV2 to evaluate our space and remove unnecessary old QA hardware from Rack 1 (https://racktables.suse.de/index.php?page=rack&rack_id=516). After this is done we plan to execute the final move of all machines between 22.-24. of February.

Actions #9

Updated by nicksinger about 2 years ago

I did the "inventory" today and have them written down in my (paper) notebook. I now need to consolidate them and figure out what is really necessary to move

Actions #10

Updated by xlai about 2 years ago

nicksinger wrote:

We need to move our big QA lab into SRV2. Our current plan is that we collect the required specs (HU’s, power sockets, network sockets) before Wed 16th. On Wed we will - together with gschlotter - take a look into SRV2 to evaluate our space and remove unnecessary old QA hardware from Rack 1 (https://racktables.suse.de/index.php?page=rack&rack_id=516). After this is done we plan to execute the final move of all machines between 22.-24. of February.

  • @nicksinger @mgriessmeier Hello Nick and Mattias, most of virtualization ipmi test machines are located in Server room "NUE-3.2.16". Is this the one that is planned to be moved in your above plan? If yes, given that 15sp4 public beta milestone will come at 2022-02-24 and the priority of this task is low, can we ask you to postpone the lab movement to one week later? Recently, after one of our machine was moved from one lab to another, it can't work stablely any more in regard of ipmi sol console. We have to disable the worker. See poo#106056 for more details. So we really worry about our machine stability after they are moved. Is it okay to postpone the movement? @maritawerner @jstehlik @waynechen55 FYI.
Actions #11

Updated by xlai about 2 years ago

  • Related to action #106056: [virtualization][tools] Improve retry behaviour and connection error handling in backend::ipmi (was: "Fail to connect openqaipmi5-sp.qa.suse.de on our osd environment") size:M added
Actions #12

Updated by mgriessmeier about 2 years ago

xlai wrote:

nicksinger wrote:

We need to move our big QA lab into SRV2. Our current plan is that we collect the required specs (HU’s, power sockets, network sockets) before Wed 16th. On Wed we will - together with gschlotter - take a look into SRV2 to evaluate our space and remove unnecessary old QA hardware from Rack 1 (https://racktables.suse.de/index.php?page=rack&rack_id=516). After this is done we plan to execute the final move of all machines between 22.-24. of February.

  • @nicksinger @mgriessmeier Hello Nick and Mattias, most of virtualization ipmi test machines are located in Server room "NUE-3.2.16". Is this the one that is planned to be moved in your above plan? If yes, given that 15sp4 public beta milestone will come at 2022-02-24 and the priority of this task is low, can we ask you to postpone the lab movement to one week later? Recently, after one of our machine was moved from one lab to another, it can't work stablely any more in regard of ipmi sol console. We have to disable the worker. See poo#106056 for more details. So we really worry about our machine stability after they are moved. Is it okay to postpone the movement? @maritawerner @jstehlik @waynechen55 FYI.

unfortunately it is not possible to postpone the move, as the physical connection to the lab will be cut off on 28th of february.
The initial plan was to do that until end of march, but recent concerns regarding common criteria, require us to move now...

We will try our best to keep the impact as low as possible

Actions #13

Updated by nicksinger about 2 years ago

  • Priority changed from Low to High

I also want to use this opportunity to emphasize again that our labs where never considered stable server rooms. This is exactly the reason why we now see problems like https://progress.opensuse.org/issues/106056 - simply connecting a machine from one switch to another raises problems which are beyond our scope to debug and trace down. I raised this point already several times in the past and urged everyone to install production hardware only in SRV1 or SRV2 and not our Labs. Therefore I also think that the move to SRV2 should bring some benefits for all of us. We had quite some time in the past to rectify this situation and we're all not happy with the current time plan but as Matthi already mentioned we will do our best to keep the impact as low as possible.

I will raise the priority here as well to reflect the current urgency as plans are now more concrete.

Actions #14

Updated by xlai about 2 years ago

@mgriessmeier @nicksinger , thanks for the reply. Now I understand better the urgency of this move and you are aware of public beta and trying best to decrease the impact. We agree with it. Thanks a lot for the efforts!

Actions #15

Updated by okurz about 2 years ago

nicksinger wrote:

I also want to use this opportunity to emphasize again that our labs where never considered stable server rooms.

I fully support this statement. I would like to use this opportunity to remind that while we are doing our best to support testing on bare-metal, virtualized infrastructure will likely always be more stable and scalable. And the according hardware for that is situated in the production-grade SRV1 server room, at least for x86_64, s390x.

Actions #16

Updated by nicksinger about 2 years ago

Also adding the most recent update regarding the move here:

=== qsf-cluster VMs ===
running on seth & osiris - have been moved today, all VMs are running smoothly again - please let us know if you see any issues with them.

=== qamaster VMs ===
physical machine has been moved, all VMs and services are up and running

=== qanet.qa.suse.de ===
core of 10.162.x.x subnet - providing DNS/DHCP and PXE - was moved and all services are up and running

=== grenache.qa.suse.de ===
mainly working as jumphost in openQA for different backends, e.g. s390x - was moved and all openqa services are up and running

=== Virtualization testing servers ===
mainly IPMI hosts for openQA, we have moved gonzo, fozzie, scooter, kermit & quinn. all machines are reachable over BMC and should cause no problems.
I also just checked openQA and jobs are working as expected.

For tomorrow we plan to move the rest of the machines, mainly consisting of the Kernel test machines and some other leftovers. Given the experience from today, the impact on (openQA) testing is considered low.

Additionally we plan to move the remaining workstations to an office in 2nd floor. They will be connected to 10.160.x.x subnet (SUSE R&D) if you don't need them in VLAN 12 (10.162.x.x aka qanet). Please make sure to power them down before 10am, we will try to contact you in any case before we shut them down physically.

Actions #17

Updated by okurz about 2 years ago

  • Related to action #107437: [alert] Recurring "no data" alerts with only few minutes of outages since SUSE Nbg QA labs move size:M added
Actions #18

Updated by okurz about 2 years ago

  • Related to action #107257: [alert][osd] Apache Response Time alert size:M added
Actions #19

Updated by okurz about 2 years ago

  • Status changed from Feedback to In Progress
Actions #20

Updated by openqa_review about 2 years ago

  • Due date set to 2022-03-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by okurz almost 2 years ago

As long as racktables still show QA labs I would say we are not done here :)

Actions #22

Updated by livdywan almost 2 years ago

  • Due date changed from 2022-03-25 to 2022-05-27

okurz wrote:

As long as racktables still show QA labs I would say we are not done here :)

Ack. I brought it up in jitsi the other day and we basically concluded that it's going to take more time, and we don't know when this will realistically move ahead.

Actions #23

Updated by mkittler almost 2 years ago

  • Subject changed from Organize labs move to new building and SRV2 to Organize labs move to new building and SRV2 size:M
  • Description updated (diff)
Actions #25

Updated by okurz almost 2 years ago

Walkthrough to current information in racktables with nsinger

  1. racktable entries for old QA labs still exist. After completing NUE-SRV2-B-3+NUE-SRV2-B-4 nsinger plans to revisit entries from old QA labs and move to virtual scrap racks
  2. The "common name" should be including a suffix .qa.suse.de, this is what we have in most machines. We have that added in some machines rack 1 already
  3. https://wiki.racktables.org/index.php/RackTablesUserGuide#SNMP_Sync says that racktables can get information from network switches automatically, sounds nice and worthwhile to explore deeper
  4. There are some differing "tags", e.g. the "Usage Type" which can be "Testing"/"Development"/"Production". We suggest to avoid "Development" and use "Production" for machines that we expect to be "mostly up", e.g. qanet, qamaster, etc., or anything that is maintained by SUSE QE Tools. All bare-metal testing machines should then be "Testing"
  5. We wonder about if grenache-4 and grenache-5 even still exist. Also what about grenache-2, grenache-3? We tried to access an HMC but neither powerhmc1.arch.suse.de nor powerhmc2.arch.suse.de are reachable. We tried over novalink on grenache. With command pvmctl lpar list we found grenache-1 through grenache-8. We found for some other machines that only "free slots" are used. -> Based on https://suse.slack.com/archives/C029APBKLGK/p1649246680601499?thread_ts=1649241786.564629&cid=C029APBKLGK I could make that look proper by deleting all objects except for the first within the server chassis and now we have 3 free slots showing up
  6. I realized that nicksinger has permission to change types of objects, I don't. Update: Both nicksinger and okurz can change the type of some objects but not for others. Created https://sd.suse.com/servicedesk/customer/portal/1/SD-82695
  7. Having at least the MAC address for each machine is helpful to debug. We checked holmes.qa.suse.de as the first non-production machine in NUE-SRV2-B-Rack-1. Using IPMI credentials from https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls we could login to https://sp.holmes.qa.suse.de/ (equivalent to https://holmes-sp.qa.suse.de/ , CNAME entry in DNS). Nice surprise we could find that the HMC of holmes knows (likely from SNMP) on which switch port and switch mac address it is connected. We crosschecked that by looking into the configuration ssh interface of the switch
  8. We found that SNMP lookup in racktables works nicely, e.g. on qanet15nue.qa.suse.de, v1, public. With that we can update configuration in racktables. Maybe we can allow hosts to set the switch with more SNMP stuff like port description and then we read that port description into racktables over the SNMP sync function?
Actions #26

Updated by nicksinger almost 2 years ago

I moved all old machines from the small lab into https://racktables.suse.de/index.php?page=row&row_id=15352 and could delete "QE small" successfully. Now the same for the big lab.

Actions #27

Updated by nicksinger almost 2 years ago

  • Status changed from In Progress to Resolved

QA Big is also deleted now. For scrap see comment/link above, for "cold storage" see https://racktables.suse.de/index.php?page=rack&tab=default&rack_id=16586

Actions #28

Updated by okurz almost 2 years ago

  • Copied to action #109746: Improve QA related server room management, consistent naming and tagging size:M added
Actions

Also available in: Atom PDF