Project

General

Profile

Actions

action #107731

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

Salt all SUSE QA machines, at least passwords and ssh keys and automatic upgrading size:M

Added by okurz over 2 years ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2022-03-01
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

See #107173#note-6 . I also chatted with nsinger about this: "s.qa is now salted, should we add it as salt node to OSD for auto-update+monitoring? What other machines should we add? Where does it end? If we include backup.qa, s.qa, storage.qa, why not also include qamaster, qanet? Or all QA machines? Nick Singer: Indeed I plan to eventually add all QA machines into a single salt for at least stuff like passwords and ssh keys"

Acceptance criteria

  • AC1: All common production QA machines are controlled by salt (not workstations or bare-metal test machines)

Suggestions

  • Review all common production QA machines in racktables and VMs and ensure they are controlled at least by some remote management framework repository, at best salt-states-openqa which also ensures automatic updates and monitoring
  • If you don't know if a machine is production or not ask okurz
  • For machines that involve another repository ensure that they are still included in automatic updates and monitoring
  • For any machines that are not straight-forward make sure that a specific open ticket exists covering that machine
  • Use Racktables to find out what common production QA machines are

Out of scope

  • openqa.opensuse.org infrastructure completely

Related issues 12 (1 open11 closed)

Related to QA - action #107173: s.qa.suse.de needs to be upgraded to a current OSResolvedokurz2022-02-21

Actions
Related to QA - coordination #131525: [epic] Up-to-date and usable LSG QE NUE1 machinesResolvedokurz2023-06-28

Actions
Related to QA - action #132323: Bring arm4.qe.suse.de up-to-dateResolvedokurz2023-06-28

Actions
Related to QA - action #132320: Bring styx.qam.suse.de up-to-dateResolvedokurz2023-06-28

Actions
Related to QA - action #132362: Bring openqa-service.qe.suse.de up-to-dateResolvedokurz2023-06-28

Actions
Related to QA - action #132359: Bring galileo.qam.suse.de up-to-date size:MResolvedokurz2023-06-28

Actions
Related to QA - action #132356: Bring fibonacci.qam.suse.de up-to-dateResolvedokurz2023-06-28

Actions
Related to QA - action #132353: Bring enterprise-nx02.qam.suse.de up-to-date size:MResolvedokurz2023-06-28

Actions
Related to QA - action #132347: Bring borg.qam.suse.de up-to-dateResolvedokurz2023-06-28

Actions
Related to openQA Infrastructure - action #116716: Repurpose ix64ph1079, ix64ph1080, ix64ph1081, e.g. as openQA workersNew

Actions
Related to QA - action #130796: Use free blades on quake.qe.nue2.suse.org and unreal.qe.nue2.suse.org as openQA OSD bare-metal test machinesResolvedokurz2023-02-09

Actions
Related to openQA Infrastructure - action #151390: Brute-force salt osiris so that we enable self-management of VMs for users size:MResolvedmkittler2023-11-24

Actions
Actions #1

Updated by okurz over 2 years ago

  • Related to action #107173: s.qa.suse.de needs to be upgraded to a current OS added
Actions #2

Updated by okurz over 2 years ago

Discussed with mgriessmeier. We see that currently existing QE infrastructure can benefit from structured infrastructure management, e.g. automatic upgrades, reboots, ssh key handling with salt. We prefer to have a separate git repository for the QE infrastructure.

Actions #3

Updated by okurz over 1 year ago

  • Tags set to infra
  • Parent task set to #118636
Actions #4

Updated by okurz about 1 year ago

  • Target version changed from future to Ready
Actions #5

Updated by mkittler about 1 year ago

  • Subject changed from Salt all SUSE QA machines, at least passwords and ssh keys and automatic upgrading to Salt all SUSE QA machines, at least passwords and ssh keys and automatic upgrading size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz about 1 year ago

  • Priority changed from Normal to High
Actions #7

Updated by okurz 12 months ago

  • Priority changed from High to Urgent

This becomes an important pre-requisite for efficient handling of #121720 as well so that we can have access to machines and by that can find out if there are any unused ressources to find out the best datacenter migration target location.

Actions #8

Updated by okurz 12 months ago

  • Description updated (diff)
Actions #9

Updated by okurz 12 months ago

Actions #10

Updated by livdywan 12 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I'm taking a look now, going to go through the list of what's in Production and checking the state of each machine and what's salted and what isn't.

Actions #11

Updated by livdywan 12 months ago

Double-checked all machines by git grepping salt-pillars, ssh and sudo salt-key -L respectively:

  • ada: Blocked by #115562
  • arm4.qe.suse.de is not reachable via SSH
    • check with UV squad what the state of it is
  • backup.qam.suse.de #131528
  • borg.qam.suse.de ssh: connect to host borg.qam.suse.de port 22: Connection refused
    • file ticket
  • conan.qam.suse.de: Blocked by #115562
  • enterprise-nx02.qam.suse.de
    • file ticket to add the machine to salt
  • fibonacci.qam.suse.de
    • file ticket to add the machine to salt
    • "Shutdown on 20.7.22" comment in racktables!?
  • galileo.qam.suse.de
    • machcine needs to be salted
    • running SLES 11 SP4 so probably needs to be updated first
  • grenache.qa.suse.de
    • salted
    • grenache is the same machine / the chassis
  • ix64ph1075.qa.suse.de
    • salted
  • Linux ONE III
    • rack, can't salt this
  • openqa-service.qe.suse.de
    • needs salt
  • openqaw5-xen.qa.suse.de
    • salted, all good
  • powerqaworker-qam-1.qa.suse.de
    • salted, all good
  • QA-Power8-4.qa.suse.de
    • salted, all good
  • QA-Power8-4.qa.suse.de
    • salted, all good
  • qamaster.qa.suse.de
    • salted, all good
  • qanet.qa.suse.de
    • old machine that we still need in NUE1, see #81192 . Expected to be obsoleted with #130955
  • styx.qam.suse.de
    • not currently salted
    • formerly maintained by dabatianni+apappas
    • check with virt squad and UV squad. We can login using the old QAM root password. It's an ESXi server, not saltable
    • remove Production tag?
    • copy #131528 and make a new ticket
  • walter1.qe.nue2.suse.org
    • worker, salted
  • walter2.qe.nue2.suse.org
    • DHCP, maintained by eng infra
  • whale.qam.suse.de
    • see styx
  • worker10/11
    • salted

General notes:

  • Can we merge QA/QAM tags or replace both with something sensible?
  • Racks should be visible as "production but not saltable" grenache, Linux ONE III
Actions #12

Updated by okurz 12 months ago

As discussed my suggestions:

  1. racktables and netbox are out of sync. Wait for #132293 before trying to come up with a fancy solution regarding tags. However I think we can still add tags in racktables if we come up with a reasonable suggestion
  2. For any specific machine needing clarification clone #131528 with the same parent for the specific machine
  3. Go over the list of all machines in racktables without the production tag as well to find out if there is any QE/QA machine that should have the production tag
Actions #13

Updated by openqa_review 12 months ago

  • Due date set to 2023-07-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by livdywan 12 months ago

Actions #15

Updated by livdywan 12 months ago

Actions #16

Updated by livdywan 12 months ago

  • Related to action #132362: Bring openqa-service.qe.suse.de up-to-date added
Actions #17

Updated by livdywan 12 months ago

  • Related to action #132359: Bring galileo.qam.suse.de up-to-date size:M added
Actions #18

Updated by livdywan 12 months ago

  • Related to action #132356: Bring fibonacci.qam.suse.de up-to-date added
Actions #19

Updated by livdywan 12 months ago

  • Related to action #132353: Bring enterprise-nx02.qam.suse.de up-to-date size:M added
Actions #20

Updated by livdywan 12 months ago

Actions #21

Updated by livdywan 12 months ago

  • Related to action #116716: Repurpose ix64ph1079, ix64ph1080, ix64ph1081, e.g. as openQA workers added
Actions #22

Updated by livdywan 12 months ago

okurz wrote:

  1. Go over the list of all machines in racktables without the production tag as well to find out if there is any QE/QA machine that should have the production tag

Looking through machines w/o the Production tag:

  • andromeda.openqanet.opensuse.org has no Production tag, presumably owned by QAC, probably Production but not tagged as such - maybe we can assume "Team" means not ours, and it's o3 so we shouldn't care in this context. Just mentioning it here since I wasn't sure at first.
  • blackcurrant has the "Team" tag and is ours
  • davinci.qam.suse.de
  • frisch.qam.suse.de -> TODO ticket
    • dabatianni+apappas
    • not responding to ping nor SSH
    • should it be in production?
  • haldir.qa.suse.de
  • ix64ph1080
  • kadmeia.qe.nue2.suse.org
    • 2023-04-19: Partially connected, can be repurposed
  • kynane.qe.nue2.suse.org
    • 2023-04-19: Partially connected, can be repurposed
  • loge.qam.suse.de -> qam ref host is not "Production"
    • Q A - M K E R N E L U P D A T E R E F E R E N C E H O S T
    • Shutdown on 20.7.22
    • not responsive to SSH
    • is this machine still there?
  • mime.qam.suse.de -> qam ref host is not "Production"
    • Kernel ref host
    • Shutdown on 20.7.22
    • not responsive to SSH
    • is this machine still there?
  • nofx.arch.suse.de -> everything in arch can be considered "Testing", not "Production"
    • no owner or purpose specified
    • seems to be online but couldn't successfully connect
  • quake.qe.nue2.suse.org -> not to be salted as intended for "workstation replacements" hence not "Production"
  • unreal.qe.nue2.suse.org
    • #131552
  • sol.qam.suse.de -> to be decommissioned, not "Production". updated https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9314
    • no owner or purpose specified
  • serial.qam.suse.de -> "Move to Frankencampus" for decommissioning
    • no owner or purpose specified
  • thunderx21.qe.nue2.suse.org -> Marked as "Development" in https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9570, not "Production"
    • no owner or purpose specified
    • seems to be offline -> to be handled in #132383
  • {seth,osiris}.qe.nue2.suse.org -> to be salted, see #132452
    • no owner or purpose specified -> I added a description and wiki link
Actions #23

Updated by livdywan 12 months ago

  • Related to action #130796: Use free blades on quake.qe.nue2.suse.org and unreal.qe.nue2.suse.org as openQA OSD bare-metal test machines added
Actions #25

Updated by livdywan 12 months ago

  • Due date deleted (2023-07-19)
  • Status changed from In Progress to Blocked

Just to be clear since this question has come up a second time. There's several existing as well as new tickets about salting and upgrading relevant machines - especially upgrading in my view likely exceeds our mean the cycle time and raises new questions. We did discuss this in previous conversations. Hence I am considering the ticket Blocked on all related tickets - we can't use subtasks because, as also mentioned in earlier comments, they have a different parent ticket.

If others feel confident to salt and upgrade all machines within a couple of days that's fine by me, but then I would not personally attempt it.

Actions #26

Updated by livdywan 11 months ago

Next up: #132359 #131552

Actions #27

Updated by okurz 11 months ago

  • Priority changed from Urgent to Normal

The urgency of #107731-7 was resolved by analyzing all machines. Right now we don't have any more capacity to accomodate all necessary blockers in the backlog. I can only squeeze in a single one right now #132359

Actions #28

Updated by okurz 11 months ago

  • Assignee changed from livdywan to okurz
  • Target version changed from Ready to future

No capacity to accomodate any of the blocking tasks in current backlog, tracking outside that scope.

Actions #29

Updated by okurz 7 months ago

#132359 and all related actually in-use machines had been put into production or marked accordingly as unused in racktables. This leaves only #151390, blocking on that

Actions #30

Updated by okurz 7 months ago

  • Related to action #151390: Brute-force salt osiris so that we enable self-management of VMs for users size:M added
Actions #31

Updated by okurz 7 months ago

  • Status changed from Blocked to Resolved
  • Target version changed from future to Ready

#151390 resolved and now both osiris+seth are covered in salt, so we can resolve here.

Actions

Also available in: Atom PDF