action #107731
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
Salt all SUSE QA machines, at least passwords and ssh keys and automatic upgrading size:M
0%
Description
Motivation¶
See #107173#note-6 . I also chatted with nsinger about this: "s.qa is now salted, should we add it as salt node to OSD for auto-update+monitoring? What other machines should we add? Where does it end? If we include backup.qa, s.qa, storage.qa, why not also include qamaster, qanet? Or all QA machines? Nick Singer: Indeed I plan to eventually add all QA machines into a single salt for at least stuff like passwords and ssh keys"
Acceptance criteria¶
- AC1: All common production QA machines are controlled by salt (not workstations or bare-metal test machines)
Suggestions¶
- Review all common production QA machines in racktables and VMs and ensure they are controlled at least by some remote management framework repository, at best salt-states-openqa which also ensures automatic updates and monitoring
- If you don't know if a machine is production or not ask okurz
- For machines that involve another repository ensure that they are still included in automatic updates and monitoring
- For any machines that are not straight-forward make sure that a specific open ticket exists covering that machine
- Use Racktables to find out what common production QA machines are
Out of scope¶
- openqa.opensuse.org infrastructure completely
Updated by okurz almost 3 years ago
- Related to action #107173: s.qa.suse.de needs to be upgraded to a current OS added
Updated by okurz almost 3 years ago
Discussed with mgriessmeier. We see that currently existing QE infrastructure can benefit from structured infrastructure management, e.g. automatic upgrades, reboots, ssh key handling with salt. We prefer to have a separate git repository for the QE infrastructure.
Updated by okurz almost 2 years ago
- Tags set to infra
- Parent task set to #118636
Updated by mkittler over 1 year ago
- Subject changed from Salt all SUSE QA machines, at least passwords and ssh keys and automatic upgrading to Salt all SUSE QA machines, at least passwords and ssh keys and automatic upgrading size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Priority changed from High to Urgent
This becomes an important pre-requisite for efficient handling of #121720 as well so that we can have access to machines and by that can find out if there are any unused ressources to find out the best datacenter migration target location.
Updated by okurz over 1 year ago
- Related to coordination #131525: [epic] Up-to-date and usable LSG QE NUE1 machines added
Updated by livdywan over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
I'm taking a look now, going to go through the list of what's in Production and checking the state of each machine and what's salted and what isn't.
Updated by livdywan over 1 year ago
Double-checked all machines by git grepping salt-pillars, ssh and sudo salt-key -L respectively:
- ada: Blocked by #115562
- arm4.qe.suse.de is not reachable via SSH
- check with UV squad what the state of it is
- backup.qam.suse.de #131528
- borg.qam.suse.de ssh: connect to host borg.qam.suse.de port 22: Connection refused
- file ticket
- conan.qam.suse.de: Blocked by #115562
- enterprise-nx02.qam.suse.de
- file ticket to add the machine to salt
- fibonacci.qam.suse.de
- file ticket to add the machine to salt
- "Shutdown on 20.7.22" comment in racktables!?
- galileo.qam.suse.de
- machcine needs to be salted
- running SLES 11 SP4 so probably needs to be updated first
- grenache.qa.suse.de
- salted
- grenache is the same machine / the chassis
- ix64ph1075.qa.suse.de
- salted
- Linux ONE III
- rack, can't salt this
- openqa-service.qe.suse.de
- needs salt
- openqaw5-xen.qa.suse.de
- salted, all good
- powerqaworker-qam-1.qa.suse.de
- salted, all good
- QA-Power8-4.qa.suse.de
- salted, all good
- QA-Power8-4.qa.suse.de
- salted, all good
- qamaster.qa.suse.de
- salted, all good
- qanet.qa.suse.de
- styx.qam.suse.de
- not currently salted
- formerly maintained by dabatianni+apappas
- check with virt squad and UV squad. We can login using the old QAM root password. It's an ESXi server, not saltable
- remove Production tag?
- copy #131528 and make a new ticket
- walter1.qe.nue2.suse.org
- worker, salted
- walter2.qe.nue2.suse.org
- DHCP, maintained by eng infra
- whale.qam.suse.de
- see styx
- worker10/11
- salted
General notes:
- Can we merge QA/QAM tags or replace both with something sensible?
- Racks should be visible as "production but not saltable" grenache, Linux ONE III
Updated by okurz over 1 year ago
As discussed my suggestions:
- racktables and netbox are out of sync. Wait for #132293 before trying to come up with a fancy solution regarding tags. However I think we can still add tags in racktables if we come up with a reasonable suggestion
- For any specific machine needing clarification clone #131528 with the same parent for the specific machine
- Go over the list of all machines in racktables without the production tag as well to find out if there is any QE/QA machine that should have the production tag
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 1 year ago
- Related to action #132323: Bring arm4.qe.suse.de up-to-date added
Updated by livdywan over 1 year ago
- Related to action #132320: Bring styx.qam.suse.de up-to-date added
Updated by livdywan over 1 year ago
- Related to action #132362: Bring openqa-service.qe.suse.de up-to-date added
Updated by livdywan over 1 year ago
- Related to action #132359: Bring galileo.qam.suse.de up-to-date size:M added
Updated by livdywan over 1 year ago
- Related to action #132356: Bring fibonacci.qam.suse.de up-to-date added
Updated by livdywan over 1 year ago
- Related to action #132353: Bring enterprise-nx02.qam.suse.de up-to-date size:M added
Updated by livdywan over 1 year ago
- Related to action #132347: Bring borg.qam.suse.de up-to-date added
Updated by livdywan over 1 year ago
- Related to action #116716: Repurpose ix64ph1079, ix64ph1080, ix64ph1081, e.g. as openQA workers added
Updated by livdywan over 1 year ago
okurz wrote:
- Go over the list of all machines in racktables without the production tag as well to find out if there is any QE/QA machine that should have the production tag
Looking through machines w/o the Production tag:
- andromeda.openqanet.opensuse.org has no Production tag, presumably owned by QAC, probably Production but not tagged as such - maybe we can assume "Team" means not ours, and it's o3 so we shouldn't care in this context. Just mentioning it here since I wasn't sure at first.
- blackcurrant has the "Team" tag and is ours
- should it be in production? -> No, it has tag "Testing" and it's PowerPC used actually for testing so it's fine. I added a comment in https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=11350
- not responding to SSH -> yes, that's expected because only the LPARs would be reachable directly, e.g. "blackcurrant-1" or however they would be called
- davinci.qam.suse.de
- dabatianni+apappas
- should it be in production? -> yes, because the machine is used as DHCP/DNS server in the qam.suse.de domain . I added a comment in https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=10308 and links to both https://confluence.suse.com/display/maintenanceqa/DNS+Server+NUE and https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/id/davinci_qam_suse_de.sls
- "Managed by SaltStack"
- frisch.qam.suse.de -> TODO ticket
- dabatianni+apappas
- not responding to ping nor SSH
- should it be in production?
- haldir.qa.suse.de
- 2023-05-23: According to discussion https://suse.slack.com/archives/C02CANHLANP/p1684777943823339 [] the machine is currently not used
- The machine is in the process to be moved to PRG2. Coordinated in #132140
- ix64ph1080
- kadmeia.qe.nue2.suse.org
- 2023-04-19: Partially connected, can be repurposed
- kynane.qe.nue2.suse.org
- 2023-04-19: Partially connected, can be repurposed
- loge.qam.suse.de -> qam ref host is not "Production"
- Q A - M K E R N E L U P D A T E R E F E R E N C E H O S T
- Shutdown on 20.7.22
- not responsive to SSH
- is this machine still there?
- mime.qam.suse.de -> qam ref host is not "Production"
- Kernel ref host
- Shutdown on 20.7.22
- not responsive to SSH
- is this machine still there?
- nofx.arch.suse.de -> everything in arch can be considered "Testing", not "Production"
- no owner or purpose specified
- seems to be online but couldn't successfully connect
- quake.qe.nue2.suse.org -> not to be salted as intended for "workstation replacements" hence not "Production"
- unreal.qe.nue2.suse.org
- #131552
- sol.qam.suse.de -> to be decommissioned, not "Production". updated https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9314
- no owner or purpose specified
- serial.qam.suse.de -> "Move to Frankencampus" for decommissioning
- no owner or purpose specified
- thunderx21.qe.nue2.suse.org -> Marked as "Development" in https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9570, not "Production"
- no owner or purpose specified
- seems to be offline -> to be handled in #132383
- {seth,osiris}.qe.nue2.suse.org -> to be salted, see #132452
- no owner or purpose specified -> I added a description and wiki link
Updated by livdywan over 1 year ago
- Related to action #130796: Use free blades on quake.qe.nue2.suse.org and unreal.qe.nue2.suse.org as openQA OSD bare-metal test machines added
Updated by livdywan over 1 year ago
- Due date deleted (
2023-07-19) - Status changed from In Progress to Blocked
Just to be clear since this question has come up a second time. There's several existing as well as new tickets about salting and upgrading relevant machines - especially upgrading in my view likely exceeds our mean the cycle time and raises new questions. We did discuss this in previous conversations. Hence I am considering the ticket Blocked on all related tickets - we can't use subtasks because, as also mentioned in earlier comments, they have a different parent ticket.
If others feel confident to salt and upgrade all machines within a couple of days that's fine by me, but then I would not personally attempt it.
Updated by okurz over 1 year ago
- Priority changed from Urgent to Normal
Updated by okurz over 1 year ago
- Assignee changed from livdywan to okurz
- Target version changed from Ready to future
No capacity to accomodate any of the blocking tasks in current backlog, tracking outside that scope.
Updated by okurz about 1 year ago
- Related to action #151390: Brute-force salt osiris so that we enable self-management of VMs for users size:M added
Updated by okurz about 1 year ago
- Status changed from Blocked to Resolved
- Target version changed from future to Ready
#151390 resolved and now both osiris+seth are covered in salt, so we can resolve here.