Project

General

Profile

Actions

action #134912

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Gradually phase out NUE1 based openQA workers size:M

Added by okurz 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

PRG2 workers along with NUE2 and PRG1 based openQA workers should take over all the work for OSD. To ensure we have all worker classes covered we should gradually phase out workers, e.g. disable worker instances or power off machines one by one and closely check scheduled tests and https://monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1 to ensure that all worker classes have some still available ressources to work on.

Acceptance criteria

  • AC1: All NUE1 based openQA workers scheduled to move to NUE3 are offline and ready for physical migration

Suggestions


Related issues 9 (1 open8 closed)

Related to openQA Infrastructure - action #134948: Ensure IPv6 is working in the OSD setup (since we have workers in PRG2 and the VM has been migrated) size:MResolvedokurz2023-08-31

Actions
Related to openQA Infrastructure - action #134879: reverse DNS resolution PTR for openqa.oqa.prg2.suse.org. yields "3(NXDOMAIN)" for PRG1 workers (NUE1+PRG2 are fine) size:MResolvedokurz2023-08-31

Actions
Related to openQA Infrastructure - action #134132: Bare-metal control openQA worker in NUE2 size:MResolvedokurz

Actions
Related to openQA Infrastructure - action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:MResolveddheidler2023-06-29

Actions
Related to openQA auto review - openqa-force-result #134807: [qe-core] test fails in bootloader_start - missing assets on s390 NewActions
Related to openQA Infrastructure - action #137270: [FIRING:1] host_up (malbec: host up alert openQA malbec host_up_alert_malbec worker) and similar about malbecResolvedokurz2023-10-01

Actions
Related to openQA Infrastructure - coordination #131519: [epic] Additional redundancy for OSD virtualization testingResolvedokurz2023-02-09

Actions
Related to openQA Tests - action #137384: [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" size:MResolvedokurz2023-10-04

Actions
Copied from openQA Infrastructure - action #132137: Setup new PRG2 openQA worker for osd size:MResolvedmkittler2023-06-29

Actions
Actions #1

Updated by okurz 8 months ago

  • Copied from action #132137: Setup new PRG2 openQA worker for osd size:M added
Actions #2

Updated by livdywan 8 months ago

  • Subject changed from Gradually phase out NUE1 based openQA workers to Gradually phase out NUE1 based openQA workers size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz 8 months ago

  • Priority changed from High to Urgent
Actions #4

Updated by okurz 8 months ago

I already disabled and switched off worker13 and worker9 for good now, see #135122-18

Actions #5

Updated by okurz 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #7

Updated by okurz 8 months ago

  • Related to action #134948: Ensure IPv6 is working in the OSD setup (since we have workers in PRG2 and the VM has been migrated) size:M added
Actions #8

Updated by okurz 8 months ago

  • Related to action #134879: reverse DNS resolution PTR for openqa.oqa.prg2.suse.org. yields "3(NXDOMAIN)" for PRG1 workers (NUE1+PRG2 are fine) size:M added
Actions #9

Updated by okurz 8 months ago

  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to High

I added more commits to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/599 . However as there are some special worker classes I suggest we only merge after having IPv6 and backward DNS PTR resolution working so blocking on #134948 and #134879

Actions #10

Updated by okurz 8 months ago

  • Related to action #134132: Bare-metal control openQA worker in NUE2 size:M added
Actions #12

Updated by okurz 7 months ago

  • Status changed from Blocked to In Progress

Continuing here due to #123800-14

Actions #13

Updated by okurz 7 months ago

#134948 resolved, IPv6 is expected to be fully up and usable at least from OSD VM itself not necessarily for all workers.

Actions #14

Updated by openqa_review 7 months ago

  • Due date set to 2023-10-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by okurz 7 months ago

I am going over the list of workers as defined in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls from top to down and considering which we can disable and turn off.

worker2 is still needed for NUE1 based s390x, see #132152, continuing with worker3.

ssh osd "sudo salt 'worker3.oqa.suse.de' cmd.run 'sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs) && poweroff' && sudo salt-key -y -d worker3.oqa.suse.de"

worker5 is also used for s390x, continuing with worker8. Disabled and powered off.

worker9 was online but not controlled by salt (@Liv Dywan), now "properly" disabled and powered off as well.

Actions #17

Updated by okurz 7 months ago

I reviewed o3 NUE1 workers. For now I shut down w4 with the worker class config

WORKER_CLASS = openqaworker4,x86_64-v3,qemu_x86_64,pool_is_hdd,heavyload,nue,nue1,nue1-srv1
…
[7]
WORKER_CLASS = openqaworker4,qemu_x86_64,qemu_i586,pool_is_hdd,heavyload

[8]
WORKER_CLASS = openqaworker4,qemu_x86_64,qemu_i586,pool_is_hdd,heavyload

and

w6 with

WORKER_CLASS = openqaworker6,x86_64-v3,qemu_x86_64,qemu_x86_64_staging,platform_intel,nue,maxtorhof,nue1,nue1-srv1
…

[http://openqa1-opensuse]
TESTPOOLSERVER = rsync://openqa1-opensuse/tests

[https://openqa.opensuse.org]
# okurz: 2023-07-21: For now connected to prg2 webUI but still using tests from old as we do not have ssh+rsync connection to new yet, pending firewall change
TESTPOOLSERVER = rsync://openqa1-opensuse/tests

as I think based on https://openqa.opensuse.org/admin/workers/ we have enough ressources

Actions #18

Updated by okurz 7 months ago

  • Related to action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Actions #20

Updated by okurz 7 months ago

  • Due date deleted (2023-10-05)
  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal

After I announced that I switched off o3 workers there were worries due to a kernel bug preventing nested virt to work which fvogt resolved by downgrading the kernel and applying a package lock. Further work here blocked by #132134 and #132152

Actions #21

Updated by okurz 7 months ago

Also removed and powered down malbec due to #137270 now.

Actions #22

Updated by okurz 7 months ago

  • Status changed from Blocked to In Progress
Actions #24

Updated by okurz 7 months ago

Actions #25

Updated by okurz 7 months ago

first observed passed s390x jobs from imagetester https://openqa.suse.de/tests/12358710

With that now removing worker5+10 from production and powering off:

for i in 5 10; do sudo salt "worker$i.oqa.suse.de" cmd.run 'sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs) && poweroff' && sudo salt-key -y -d worker$i.oqa.suse.de; done
worker5.oqa.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@1.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@2.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@3.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@4.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@5.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@6.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@7.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@8.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@9.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@10.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@11.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@12.service.
The following keys are going to be deleted:
Accepted Keys:
worker5.oqa.suse.de
Key for minion worker5.oqa.suse.de deleted.
worker10.oqa.suse.de:
    Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@1.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@2.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@3.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@4.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@5.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@6.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@7.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@8.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@9.service.
    Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@10.service.
The following keys are going to be deleted:
Accepted Keys:
worker10.oqa.suse.de
Key for minion worker10.oqa.suse.de deleted.
Actions #26

Updated by okurz 7 months ago

  • Related to action #137270: [FIRING:1] host_up (malbec: host up alert openQA malbec host_up_alert_malbec worker) and similar about malbec added
Actions #27

Updated by okurz 7 months ago

  • Related to coordination #131519: [epic] Additional redundancy for OSD virtualization testing added
Actions #30

Updated by okurz 7 months ago

  • Status changed from In Progress to Blocked

With #132134 completed as part of this ticket in preparation of SUSE DC NUE1 decomissioning I powered off openqaworker7.openqanet.opensuse.org as well as w19+w20 now. I powered off openqaworker7.openqanet.opensuse.org as well as w19+w20 now.

This leaves xen+hyperv+bare-metal and such so blocking on #131519

Actions #31

Updated by okurz 7 months ago

  • Related to action #137384: [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" size:M added
Actions #32

Updated by okurz 7 months ago

  • Status changed from Blocked to In Progress

We will have an opportunity to move machines out of NUE1-SRV1 tomorrow. I now moved all NUE1-SRV1 based openQA worker instances out of that datacenter room with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/631 . All machines in NUE1-SRV1 will now be powered off and prepared for move to the new datacenter NUE3 "Marienberg DR".

Actions #33

Updated by okurz 7 months ago

  • Status changed from In Progress to Resolved

"AC1: All NUE1 based openQA workers scheduled to move to NUE3 are offline and ready for physical migration" fulfilled, communicated in https://suse.slack.com/archives/C02CANHLANP/p1696421058383399

Actions #35

Updated by okurz 7 months ago

  • Status changed from Resolved to In Progress
Actions #36

Updated by openqa_review 7 months ago

  • Due date set to 2023-10-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #37

Updated by okurz 7 months ago

  • Due date deleted (2023-10-19)
  • Status changed from In Progress to Resolved

Handled more machine. That should be enough. Today in the morning we fully evacuated NUE1-SRV1 as part of #132620 and also prepared all NUE3 targeted machines for move so we had to move worker instances again. With that AC fulfilled and we are good.

Actions

Also available in: Atom PDF