Project

General

Profile

Actions

coordination #121720

open

[saga][epic] QE setup in PRG2+NUE3

Added by okurz 12 months ago. Updated about 20 hours ago.

Status:
Blocked
Priority:
High
Assignee:
Target version:
Start date:
2018-06-29
Due date:
2023-12-31 (Due in 32 days)
% Done:

81%

Estimated time:
(Total: 0.00 h)
Tags:

Description

Motivation

SUSE is deprecating NUE1 (Maxtorhof) and setting up a Prague Co-Location datacenter "Prg CoLo" or "DC7" as primary location in particular for serving public services. This includes what we serve so far from VM clusters managed by EngInfra and in particular the openqa.opensuse.org infrastructure, likely also openqa.suse.de. Or defined differently: Everything that is currently served from NUE1-SRV1. We must participate in planning and setup and accordingly a migration until we can provide our services from Prg CoLo and do not rely on NUE1-SRV1 anymore except for the purpose of an optional fail-over datacenter in Nbg.
SUSE is deprecating NUE1 (Maxtorhof) and setting up replacement data centers. Additionally a new datacenter is planned as fail-over location

Acceptance criteria

  • AC1: SUSE QE Tools services are provided out of Prg CoLo #123800
  • AC2: NUE1 (Maxtorhof) is not relied upon by SUSE QE Tools anymore and has been evacuated by us #129280
  • AC3: Relevant SUSE QE Tools services are provided out of NUE3 #130955

Further details

Coordination chat room #dct-migration


Subtasks 122 (26 open96 closed)

openQA Infrastructure - action #94765: Bring openqaworker12 into production (w/o multi-machine test support) size:MRejectedokurz

Actions
openQA Infrastructure - action #94783: Bring openqaworker11 into production including multi-machine test support (same as w12) size:MRejected2021-06-28

Actions
openQA Infrastructure - action #95167: Bring openqaworker12 into production including multi-machine test supportRejectedokurz2021-07-07

Actions
coordination #116623: [epic] Migration of SUSE openQA+QA+QAM systems to new security zonesResolvedokurz2022-09-14

Actions
action #116626: Migration of SUSE QA systems to new security zones - QAM systemsResolvedokurz2022-09-15

Actions
action #116629: Preparation planning for migration of SUSE openQA+QA systems to new security zones size:MResolvedokurz2022-09-15

Actions
openQA Infrastructure - action #116689: Do not rely on statically configured IPv4 addresses for the salt master in /etc/hosts size:SResolvedokurz2022-09-14

Actions
action #117043: Request DHCP+DNS services for new QE network zones, same as already provided for .qam.suse.de and .qa.suse.czResolvedokurz

Actions
action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:MResolvedokurz2022-11-17

Actions
action #119446: Conduct the migration of SUSE openQA+QA systems from Nbg SRV2 to new security zonesResolvedokurz2022-09-15

Actions
action #119449: Conduct the migration of SUSE openQA+QA systems from Nbg QA labs to new security zonesResolvedokurz2022-09-15

Actions
action #119638: Ensure every physical machine within .qam.suse.de has an IPMI+eth L2 address entry in racktables size:MResolvedokurz

Actions
openQA Infrastructure - action #120025: [openQA][ipmi][worker] Worker host hostname changed and broken networking connectionResolvedokurz2022-11-07

Actions
openQA Infrastructure - action #120163: Use salt grains instead of manually specifying IPs in "bridge_ip" size:MResolvedmkittler

Actions
action #120264: Conduct the migration of SUSE QA systems (non-tools-team maintained) from Nbg SRV1 to new security zones size:MResolvedokurz2022-09-15

Actions
action #120267: Conduct the migration of openqa-ses aka. "storage.qa.suse.de" size:MResolvedmkittler2022-09-15

Actions
openQA Infrastructure - action #120270: Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:MResolvedmkittler

Actions
openQA Tests - action #120288: [tools] cloud based tests fail due to traffic to cloud blocked auto_review:"2022-11-0.*Test died: (Waiting for Godot.*ssh|Cannot find image after upload)":retryResolvedokurz2022-11-10

Actions
openQA Project - action #120333: [os-autoinst][ipmi] Add support for ssh jump host in IPMI backendRejectedokurz2022-11-11

Actions
openQA Infrastructure - action #120339: QEMU DNS fails to resolve openqa.suse.de via IP addressResolvedokurz2022-11-11

Actions
openQA Infrastructure - action #120441: OSD parallel jobs failed with "get_job_autoinst_url: No worker info for job xxx available" size:meowResolvedokurz2022-11-15

Actions
openQA Tests - action #120789: [virtualization] tests fail to upload to qadb on dbproxy.suse.de with "Access denied, this account is locked"Resolved

Actions
openQA Infrastructure - action #120807: [alert] openqa.suse.de - worker12.oqa.suse.de 100% packet loss due to outdated AAAA recordResolvedokurz2022-11-17

Actions
openQA Project - coordination #122650: [epic] Fix firewall block and improve error reporting when test fails in curl log uploadResolvedokurz2022-12-29

Actions
openQA Tests - action #122539: test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out size:MClosed2022-12-29

Actions
openQA Project - action #122608: exit code of shell command not received by script_runResolvedokurz2023-01-02

Actions
openQA Infrastructure - action #122653: Ask SUSE-IT network admins to REJECT packets instead of DROP so that we get more clear results size:SRejectedokurz2023-01-03

Actions
openQA Infrastructure - action #122656: Ask SUSE-IT network admins to *not* block this traffic which we need for tests regarding s390x within SUSE network size:MResolvedokurz2023-01-03

Actions
openQA Project - action #122659: Improved error reporting in openQA tests when curl times out on connection attemptsRejectedokurz2023-01-03

Actions
action #123697: Conduct the migration of SUSE QA systems s390x zVM instances to new security zones size:MResolvedokurz2022-09-15

Actions
openQA Infrastructure - action #124119: Conduct the migration of remaining SUSE openQA systems IPMI to new security zonesResolvedokurz2023-02-08

Actions
openQA Infrastructure - action #124715: Failing pipelines because of unreachable machine openqaworker-arm-1Rejected2023-02-08

Actions
coordination #124721: [epic] Ensure proper QE maintainership of Nbg QAM machinesResolvedokurz2023-02-17

Actions
action #124724: Ensure Nbg QAM machines have a current maintainer as "contact person" size:SResolvedokurz2023-02-17

Actions
action #125144: Give members of SUSE QE Tools team a chance to get familiar with Nbg QAM machines size:MResolvedokurz2023-02-17

Actions
action #125234: Decommission obsolete machines in qam.suse.de size:MResolvedokurz2023-03-01

Actions
openQA Infrastructure - action #124877: Failing pipelines because of unreachable machine openqaworker-arm-1Resolvedmkittler2023-02-08

Actions
action #107731: Salt all SUSE QA machines, at least passwords and ssh keys and automatic upgrading size:MResolvedokurz2022-03-01

Actions
openQA Infrastructure - action #125534: Consolidate the installation of openqaw5-xen with SUSE QE Tools maintained machines size:MResolvedokurz2023-03-07

Actions
openQA Infrastructure - action #125750: In salt-states-openqa support machines requiring ssh password login for root user size:MResolvedosukup

Actions
openQA Infrastructure - action #151390: Brute-force salt osiris so that we enable self-management of VMs for users size:MResolvedmkittler2023-11-24

Actions
openQA Infrastructure - action #151396: After osiris is now in salt decide about the fate of sethResolvedokurz

Actions
coordination #121726: [epic] Get management access to o3/osd and other QE related VMsBlockedokurz2022-12-08

Actions
action #121729: [timeboxed:10h][research] Find out what libvirt can do to provide access only to a single VM for users/groupsResolvedokurz2022-12-08

Actions
action #132149: Coordinate with Eng-Infra to get simple management access to VMs (o3/osd/qa-jump.qe.nue2.suse.org) size:MBlockedokurz2023-06-29

Actions
coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLoBlockedokurz2021-10-062023-12-15

Actions
openQA Infrastructure - action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:MResolveddheidler2023-06-29

Actions
openQA Infrastructure - action #132137: Setup new PRG2 openQA worker for osd size:MResolvedmkittler2023-06-29

Actions
action #132140: Support move of PowerPC machines to PRG2 size:MBlockedokurz2023-06-29

Actions
openQA Infrastructure - action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:MResolvednicksinger2023-06-29

Actions
action #132146: Support migration of osd VM to PRG2 - 2023-08-29 size:MResolvedmkittler2023-06-29

Actions
action #132158: Ensure that osd can work without relying on any physical machine in NUE1 size:MResolvedokurz2023-06-29

Actions
openQA Infrastructure - action #132461: manage tls certificates on o3/ariel directly with dehydrated size:MResolvednicksinger2023-07-07

Actions
openQA Infrastructure - action #132647: Migration of o3 VM to PRG2 - bare-metal tests size:MBlockeddheidler

Actions
openQA Infrastructure - action #133160: Setup a modern UEFI httpboot setup on o3 with dnsmasq size:MResolveddheidler2023-07-21

Actions
openQA Infrastructure - action #133181: Migration of o3 VM to PRG2 - Fix https://openqa.opensuse.org/snapshot-changes/opensuse/Tumbleweed/Resolvedokurz

Actions
openQA Infrastructure - action #133358: Migration of o3 VM to PRG2 - Ensure IPv6 is fully workingBlockedokurz

Actions
openQA Infrastructure - action #133364: Migration of o3 VM to PRG2 - Decommission old-ariel in NUE1 as soon as we do not need it anymoreResolvedokurz

Actions
openQA Infrastructure - action #133475: Migration of o3 VM to PRG2 - connection to rabbit.opensuse.orgNew

Actions
openQA Infrastructure - action #133490: Migration of o3 VM to PRG2 - Fix o3 bare metal hosts iPXE booting size:MResolveddheidler

Actions
openQA Infrastructure - action #134081: Setup new PRG2 openQA hardware https://racktables.suse.de/index.php?page=object&object_id=23373New

Actions
openQA Infrastructure - action #134123: Setup new PRG2 openQA worker for o3 - two new arm workers size:MWorkablenicksinger2023-11-29

Actions
openQA Infrastructure - action #134126: Setup new PRG2 openQA worker for o3 - bare-metal testing size:MBlockedokurz

Actions
openQA Infrastructure - action #134822: Migration of osd VM to PRG2 - Decommission old-osd in NUE1 as soon as we do not need it anymore size:MResolvedokurz2023-08-30

Actions
openQA Project - action #134837: SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:MResolvedlivdywan

Actions
openQA Infrastructure - action #134879: reverse DNS resolution PTR for openqa.oqa.prg2.suse.org. yields "3(NXDOMAIN)" for PRG1 workers (NUE1+PRG2 are fine) size:MResolvedokurz2023-08-31

Actions
action #134888: Ensure no job results are present in the file system for jobs that are no longer in the databaseNew

Actions
openQA Infrastructure - action #134900: salt states fail to apply due to "Pillar openqa.oqa.prg2.suse.org.key does not exist"Resolvednicksinger2023-08-31

Actions
openQA Infrastructure - action #134912: Gradually phase out NUE1 based openQA workers size:MResolvedokurz

Actions
openQA Infrastructure - action #135191: Migration of o3 VM to PRG2 - Use direct zabbix connection size:MResolveddheidler

Actions
openQA Infrastructure - action #137408: Support move of s390x mainframe(s) to PRG2 - o3 size:MIn Progressmgriessmeier2023-06-292023-12-07

Actions
coordination #137630: [epic] QE (non-openQA) setup in PRG2Blockedokurz2023-09-20

Actions
action #138356: Migration of qam.suse.de to PRG2 size:MResolvedokurz2023-10-23

Actions
action #139130: Migration of openqa-service to PRG2 size:MResolvedokurz

Actions
action #139106: Ensure a PRG2 based QE PowerPC HMC is reachable over proper FQDN and reverse PTRBlockedokurz2023-06-29

Actions
action #139109: Support move of non-openQA PowerPC machines to PRG2, i.e. haldir, legolas, whale, blackcurrant, cloudberry, huckleberry, soapberry, nessberryNew2023-06-29

Actions
action #139112: Ensure OSD openQA PowerPC machine grenache is operational from PRG2New2023-06-29

Actions
action #139115: Ensure o3 openQA PowerPC machine qa-power8-3 is operational from PRG2 size:MWorkable2023-06-29

Actions
action #139199: Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2Blockedokurz2023-06-29

Actions
openQA Infrastructure - action #150956: o3 cannot send e-mails via smtp relay size:MFeedbackokurz2023-11-162023-12-15

Actions
coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG DatacentersBlockedokurz2018-06-292023-12-31

Actions
coordination #37910: [tools][epic] Migration of or away from qanet.qa.suse.deBlockedokurz2018-06-29

Actions
action #38012: [tools][labs][medium] Setup DHCPv6 and DNS AAAA records for VLAN12Rejectedokurz2018-06-30

Actions
action #38018: [labs][tools] Setup new qanetRejectedokurz2018-06-29

Actions
action #81192: [tools] Migrate (upgrade or replace) qanet.qa.suse.de to a supported, current OS size:MBlockedokurz2020-12-18

Actions
action #81200: [tools][labs] some partitions on qanet are 100% full, seems like /data/backups has no new archives since 20201009 due to thatResolvedokurz2020-12-18

Actions
openQA Infrastructure - action #134051: Eng-Infra maintained DNS server for .qa.suse.de taking over from qanet size:MResolveddheidler2023-08-09

Actions
action #124221: Repurpose quake.qe.nue2.suse.org (formerly known as cloud4) as employee-workstation replacement size:MResolvedokurz2023-02-09

Actions
action #129283: [tools] Help Needed: Active Inventory of Maxtorhof SRV1/SRV2/SRV2XResolvedokurz2023-05-15

Actions
action #130796: Use free blades on quake.qe.nue2.suse.org and unreal.qe.nue2.suse.org as openQA OSD bare-metal test machinesResolvedokurz2023-02-09

Actions
openQA Infrastructure - coordination #131519: [epic] Additional redundancy for OSD virtualization testingResolvedokurz2023-02-09

Actions
openQA Infrastructure - action #131549: [spike][timeboxed:20h] Additional redundancy for OSD virtualization testing - Hyperv 2016 worker host size:MResolvednanzhang2023-06-28

Actions
openQA Infrastructure - action #133247: Additional redundancy for OSD virtualization testing - Hyperv 2019 and 2022 (or 2012r2) worker host size:MResolvedrcai2023-07-25

Actions
openQA Infrastructure - action #133367: Evaluate if we have hardware alternatives for Windows Server 2016+ testingResolvedokurz2023-07-26

Actions
openQA Infrastructure - action #137306: Check unreal6 cabling, SP and system not reachable over network size:MResolvedokurz2023-10-02

Actions
openQA Infrastructure - action #138350: worker31 and likely more OSD machines get stuck on boot in grub command lineResolveddheidler2023-06-28

Actions
action #132617: Move of selected LSG QE machines NUE1 to PRG2e size:MFeedbackokurz2023-12-08

Actions
action #133706: Setup of former QAM machines from NUE1-SRV2 in FC BasementResolvedokurz2023-08-02

Actions
action #133748: Move of openqaworker-arm-1 to FC Basement size:MWorkable

Actions
openQA Infrastructure - action #134087: Fix ix64ph1075 bare metal openQA test size:MResolveddheidler2023-08-10

Actions
openQA Infrastructure - action #134132: Bare-metal control openQA worker in NUE2 size:MResolvedokurz

Actions
openQA Infrastructure - action #136133: Migrate aarch64.openqanet.opensuse.org to FC Basement size:MResolveddheidler

Actions
action #137144: Ensure that we have less or no workstation left clogging our FC Basement space size:MFeedbackokurz2023-12-31

Actions
openQA Infrastructure - action #150869: Ensure multi-machine tests work on aarch64-o3New

Actions
coordination #130955: [epic] Migration out of SUSE NUE1 - QE setup in NUE3Resolvedokurz2023-06-20

Actions
action #131144: Decide about all LSG QE machines in NUE1 size:MResolvedokurz2023-06-20

Actions
action #132620: Move of selected LSG QE machines NUE1 to NUE3 size:MResolvedokurz2023-07-12

Actions
action #132623: Decommissioning of selected selected LSQ QE machines from NUE1-SRV2Resolvedokurz2023-07-12

Actions
openQA Infrastructure - action #132671: Ensure everybody in SUSE QE Tools knows how to access netbox size:MResolvedlivdywan2023-07-07

Actions
coordination #131525: [epic] Up-to-date and usable LSG QE NUE1 machinesResolvedokurz2023-06-28

Actions
action #131528: Bring backup.qam.suse.de up-to-date size:MResolvedokurz2023-06-28

Actions
action #132320: Bring styx.qam.suse.de up-to-dateResolvedokurz2023-06-28

Actions
action #132323: Bring arm4.qe.suse.de up-to-dateResolvedokurz2023-06-28

Actions
action #132347: Bring borg.qam.suse.de up-to-dateResolvedokurz2023-06-28

Actions
action #132353: Bring enterprise-nx02.qam.suse.de up-to-date size:MResolvedokurz2023-06-28

Actions
action #132356: Bring fibonacci.qam.suse.de up-to-dateResolvedokurz2023-06-28

Actions
action #132359: Bring galileo.qam.suse.de up-to-date size:MResolvedokurz2023-06-28

Actions
action #132362: Bring openqa-service.qe.suse.de up-to-dateResolvedokurz2023-06-28

Actions
action #132452: Bring seth+osiris up-to-dateResolvedokurz2023-06-28

Actions
openQA Infrastructure - action #134453: backup.qam.suse.de is Failed according to netbox and not creating backups size:MResolvedmkittler

Actions
openQA Infrastructure - action #134489: backup.qa.suse.de does not create backupsResolvedtinita2023-08-22

Actions
openQA Infrastructure - action #134519: We were not notified that backup.qa.suse.de did not create backups size:MResolvedlivdywan2023-08-23

Actions
Actions #2

Updated by okurz 12 months ago

  • Priority changed from Normal to Low
  • Target version changed from Ready to future

Had meeting with EngInfra TL 2022-12-07 mflores. Prg CoLo will start migrating services 2023-03, bugzilla, gitlab, virtualization clusters. s390 and PowerPC will be moved as well, likely 2023-05. They should be offline for some days and then usable again after setup in Prg CoLo. x86_64+aarch64 is ordered as new. Nbg new DC will also be setup in that time. 40 racks for everything from NUE1 that does not fit/move to FC labs. Monitoring: Prg CoLo will have switches and firewalls. They shall be configured as IaC, maybe with salt or terraform. After that monitoring is planned, but I consider it doubtful if this will work out.

2022-12-08: Decided with mgriessmeier, nsinger, mflores to order 4x ARM machines for Prg CoLo to have redundancy for each o3+osd, i.e. 2xARM@o3, 2xARM@osd

Right now waiting for DC being ready for us to use or waiting for any pending questions

Actions #3

Updated by okurz 10 months ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Target version changed from future to Ready

-> subtasks

Actions #4

Updated by okurz 10 months ago

  • Target version changed from Ready to future

I would like to track this outside our current backlog as we don't need to conduct that much work now.

Actions #5

Updated by okurz 9 months ago

  • Tags set to infra
Actions #6

Updated by okurz 6 months ago

Actions #7

Updated by okurz 6 months ago

  • Target version changed from future to Ready
Actions #8

Updated by okurz 6 months ago

  • Description updated (diff)
Actions #9

Updated by okurz 5 months ago

  • Subject changed from [saga][epic] QE setup in Prg CoLo to [saga][epic] QE setup in PRG2 aka. Prg CoLo
Actions #10

Updated by okurz 5 months ago

  • Subject changed from [saga][epic] QE setup in PRG2 aka. Prg CoLo to [saga][epic] QE setup in PRG2+NUE3
  • Description updated (diff)

Combining #121720 and #130955 as there is too much overlap

I wrote a message in https://mailman.suse.de/mailman/private/qa-team/2023-June/005988.html

Hi all,
Be advised that there are plans to fully empty the old Nuremberg
datacenter at the old office location "Maxtorhof" aka. NUE1 until end of
this year. This means moving services and machines to other locations or
decomissioning services or machines that are not needed anymore.
The SUSE QE Tools team will organize, execute and lead any necessary
actions concerning LSG QE services and machines as far as we know of.

How will you be impacted by this? In the best case you will only see
short outages of services during the actual migrations. Maybe you will
need to reach specific machines by new domains (FQDNs). Likely over the
next weeks and months individual services will have outages and
performance degradations. In the worst case critical machines that no
one considered will be lost and services need to be recovered with
careful and tricky reverse engineering. Good planning and reviews of
plans can mitigate that risk :)

According to current plans we want to setup new openQA workers in the
following weeks and the service openqa.suse.de and according virtual
machine will move to PRG2 (the new Prague datacenter) on 2023-07-17.
Expect an outage on https://openqa.nue.suse.com and
https://openqa.suse.de on that day.

The equivalent migration will be conducted for
https://openqa.opensuse.org at beginning of 2023-08.

Find more details in
https://progress.opensuse.org/issues/121720

Have fun,
Oliver

and an according copy in https://suse.slack.com/archives/C02CANHLANP/p1687787065732719

Actions #12

Updated by okurz 5 months ago

  • Project changed from 46 to QA
  • Category deleted (Infrastructure)
Actions #13

Updated by okurz 3 months ago

  • Description updated (diff)
Actions #14

Updated by okurz 3 months ago

  • Subtask #129280 added
Actions #15

Updated by okurz 3 months ago

  • Subtask #116623 added
Actions #16

Updated by okurz about 1 month ago

  • Subtask #138476 added
Actions #17

Updated by okurz about 20 hours ago

  • Subtask #118636 added
Actions

Also available in: Atom PDF