action #134912
closedGradually phase out NUE1 based openQA workers size:M
0%
Description
Motivation¶
PRG2 workers along with NUE2 and PRG1 based openQA workers should take over all the work for OSD. To ensure we have all worker classes covered we should gradually phase out workers, e.g. disable worker instances or power off machines one by one and closely check scheduled tests and https://monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1 to ensure that all worker classes have some still available ressources to work on.
Acceptance criteria¶
- AC1: All NUE1 based openQA workers scheduled to move to NUE3 are offline and ready for physical migration
Suggestions¶
- Lookup machines from https://netbox.suse.de/dcim/devices/?tag=move-to-marienberg-dr&tag=qe-lsg&site_id=46
- Check https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls for worker classes that are only available in NUE1 and make sure they are in PRG2
- Disable and switch off openQA worker instances one by one or simply power off complete machines and monitor the impact
- Fix based on not scheduled jobs or – in worst case – based on user reports
- Communicate plans/heads-up in eng-testing
Updated by okurz over 1 year ago
- Copied from action #132137: Setup new PRG2 openQA worker for osd size:M added
Updated by livdywan over 1 year ago
- Subject changed from Gradually phase out NUE1 based openQA workers to Gradually phase out NUE1 based openQA workers size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
I already disabled and switched off worker13 and worker9 for good now, see #135122-18
Updated by okurz over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Updated by okurz over 1 year ago
Updated by okurz over 1 year ago
- Related to action #134948: Ensure IPv6 is working in the OSD setup (since we have workers in PRG2 and the VM has been migrated) size:M added
Updated by okurz over 1 year ago
- Related to action #134879: reverse DNS resolution PTR for openqa.oqa.prg2.suse.org. yields "3(NXDOMAIN)" for PRG1 workers (NUE1+PRG2 are fine) size:M added
Updated by okurz over 1 year ago
- Status changed from In Progress to Blocked
- Priority changed from Urgent to High
I added more commits to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/599 . However as there are some special worker classes I suggest we only merge after having IPv6 and backward DNS PTR resolution working so blocking on #134948 and #134879
Updated by okurz over 1 year ago
- Related to action #134132: Bare-metal control openQA worker in NUE2 size:M added
Updated by okurz about 1 year ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/599 merged, still again waiting for #134948
Updated by okurz about 1 year ago
- Status changed from Blocked to In Progress
Continuing here due to #123800-14
Updated by okurz about 1 year ago
#134948 resolved, IPv6 is expected to be fully up and usable at least from OSD VM itself not necessarily for all workers.
Updated by openqa_review about 1 year ago
- Due date set to 2023-10-05
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 1 year ago
I am going over the list of workers as defined in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls from top to down and considering which we can disable and turn off.
worker2 is still needed for NUE1 based s390x, see #132152, continuing with worker3.
ssh osd "sudo salt 'worker3.oqa.suse.de' cmd.run 'sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs) && poweroff' && sudo salt-key -y -d worker3.oqa.suse.de"
worker5 is also used for s390x, continuing with worker8. Disabled and powered off.
worker9 was online but not controlled by salt (@Liv Dywan), now "properly" disabled and powered off as well.
Updated by okurz about 1 year ago
I reviewed o3 NUE1 workers. For now I shut down w4 with the worker class config
WORKER_CLASS = openqaworker4,x86_64-v3,qemu_x86_64,pool_is_hdd,heavyload,nue,nue1,nue1-srv1
…
[7]
WORKER_CLASS = openqaworker4,qemu_x86_64,qemu_i586,pool_is_hdd,heavyload
[8]
WORKER_CLASS = openqaworker4,qemu_x86_64,qemu_i586,pool_is_hdd,heavyload
and
w6 with
WORKER_CLASS = openqaworker6,x86_64-v3,qemu_x86_64,qemu_x86_64_staging,platform_intel,nue,maxtorhof,nue1,nue1-srv1
…
[http://openqa1-opensuse]
TESTPOOLSERVER = rsync://openqa1-opensuse/tests
[https://openqa.opensuse.org]
# okurz: 2023-07-21: For now connected to prg2 webUI but still using tests from old as we do not have ssh+rsync connection to new yet, pending firewall change
TESTPOOLSERVER = rsync://openqa1-opensuse/tests
as I think based on https://openqa.opensuse.org/admin/workers/ we have enough ressources
Updated by okurz about 1 year ago
- Related to action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Updated by okurz about 1 year ago
- Due date deleted (
2023-10-05) - Status changed from In Progress to Blocked
- Priority changed from High to Normal
After I announced that I switched off o3 workers there were worries due to a kernel bug preventing nested virt to work which fvogt resolved by downgrading the kernel and applying a package lock. Further work here blocked by #132134 and #132152
Updated by okurz about 1 year ago
Also removed and powered down malbec due to #137270 now.
Updated by okurz about 1 year ago
Moving all s390x controlling instances out of NUE1 with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/625/diffs?commit_id=855886d309049d03b81a89e936401d33ede7ec08
Updated by okurz about 1 year ago
- Related to openqa-force-result #134807: [qe-core] test fails in bootloader_start - missing assets on s390 added
Updated by okurz about 1 year ago
first observed passed s390x jobs from imagetester https://openqa.suse.de/tests/12358710
With that now removing worker5+10 from production and powering off:
for i in 5 10; do sudo salt "worker$i.oqa.suse.de" cmd.run 'sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs) && poweroff' && sudo salt-key -y -d worker$i.oqa.suse.de; done
worker5.oqa.suse.de:
Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@1.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@2.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@3.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@4.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@5.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@6.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@7.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@8.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@9.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@10.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@11.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@12.service.
The following keys are going to be deleted:
Accepted Keys:
worker5.oqa.suse.de
Key for minion worker5.oqa.suse.de deleted.
worker10.oqa.suse.de:
Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@1.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@2.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@3.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@4.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@5.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@6.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@7.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@8.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@9.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker-auto-restart@10.service.
The following keys are going to be deleted:
Accepted Keys:
worker10.oqa.suse.de
Key for minion worker10.oqa.suse.de deleted.
Updated by okurz about 1 year ago
- Related to action #137270: [FIRING:1] host_up (malbec: host up alert openQA malbec host_up_alert_malbec worker) and similar about malbec added
Updated by okurz about 1 year ago
- Related to coordination #131519: [epic] Additional redundancy for OSD virtualization testing added
Updated by okurz about 1 year ago
- Status changed from In Progress to Blocked
With #132134 completed as part of this ticket in preparation of SUSE DC NUE1 decomissioning I powered off openqaworker7.openqanet.opensuse.org as well as w19+w20 now. I powered off openqaworker7.openqanet.opensuse.org as well as w19+w20 now.
This leaves xen+hyperv+bare-metal and such so blocking on #131519
Updated by okurz about 1 year ago
- Related to action #137384: [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" size:M added
Updated by okurz about 1 year ago
- Status changed from Blocked to In Progress
We will have an opportunity to move machines out of NUE1-SRV1 tomorrow. I now moved all NUE1-SRV1 based openQA worker instances out of that datacenter room with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/631 . All machines in NUE1-SRV1 will now be powered off and prepared for move to the new datacenter NUE3 "Marienberg DR".
Updated by okurz about 1 year ago
- Status changed from In Progress to Resolved
"AC1: All NUE1 based openQA workers scheduled to move to NUE3 are offline and ready for physical migration" fulfilled, communicated in https://suse.slack.com/archives/C02CANHLANP/p1696421058383399
Updated by okurz about 1 year ago
- Status changed from Resolved to In Progress
I forgot about machines that are in NUE1-SRV1 but are not going to NUE3 from https://netbox.suse.de/dcim/devices/?tag=qe-lsg&tag=move-to-prague-colo or https://netbox.suse.de/dcim/devices/?tag=qe-lsg&tag=move-to-prague-colo2
Updated by openqa_review about 1 year ago
- Due date set to 2023-10-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 1 year ago
- Due date deleted (
2023-10-19) - Status changed from In Progress to Resolved
Handled more machine. That should be enough. Today in the morning we fully evacuated NUE1-SRV1 as part of #132620 and also prepared all NUE3 targeted machines for move so we had to move worker instances again. With that AC fulfilled and we are good.