coordination #99183: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui, to openSUSE Leap 15.3
Upgrade o3 workers to openSUSE Leap 15.3 size:M
- Need to upgrade workers before EOL of Leap 15.2 and have a consistent environment
- AC1: all o3 worker machines run a clean upgraded openSUSE Leap 15.3 (no failed systemd services, no left over .rpm-new files, etc.)
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the workers are only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- Use the instructions from above but use
transactional-update shellfor transactional update workers
- After upgrade reboot and check everything working as expected, if not rollback, e.g. with
- Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
#7 Updated by mkittler about 1 month ago
- Status changed from Workable to Feedback
Not sure how to upgrade the transactional workers. I cannot directly follow commands on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Distribution-upgrades. If I try to run them in
transactional-update shell the commands don't work because no root certificates exist in that environment. Just enter
openqaworker1:~ # transactional-update shell and find that
/var/lib/ca-certificates/pem is empty. The problem is also reproducible on
openqaworker4 and possibly all other workers which are transactional servers. Note that
/etc/ssl/certs is a symlink to
#8 Updated by mkittler about 1 month ago
I've been upgrading power8 and everything seemed to work well. However, it didn't came back and I cannot even reach it via
ipmitool -I lanplus -C 3 -H openqaworker-power8-ipmi.suse.de -U ADMIN -P ADMIN sol activate now.
I had to create an Infra ticket regarding power8: https://sd.suse.com/servicedesk/customer/portal/1/SD-64563
#9 Updated by mkittler about 1 month ago
@ffogt helped recovering
I used the petitboot environment to chroot into the leap install and replaced kernel-kvmsmall with kernel-default
nfs-client was not installed, also probably because of kernel-kvmsmall
I just watched it boot after a power reset, which needs some patience
Apparently the ipmi power commands and sol session worked at some point after all. It now works for me now as well. I'll respond in the Infra ticket.
#10 Updated by mkittler about 1 month ago
I've now been upgrading
openqaworker7. Both generally work now. So aarch64 openqaworker1 openqaworker4 imagetester and rebel are still remaining.
Note that there's a failing service on
openqaworker7 but it has been failing in that way at least since
Sep 05 03:46:26 (which is almost as far as the logs go back):
openqaworker7:~ # systemctl status snapper-cleanup.service ● snapper-cleanup.service - Daily Cleanup of Snapper Snapshots Loaded: loaded (/usr/lib/systemd/system/snapper-cleanup.service; static) Active: failed (Result: exit-code) since Tue 2021-10-26 15:34:21 CEST; 2min 30s ago TriggeredBy: ● snapper-cleanup.timer Docs: man:snapper(8) man:snapper-configs(5) Process: 23457 ExecStart=/usr/lib/snapper/systemd-helper --cleanup (code=exited, status=1/FAILURE) Main PID: 23457 (code=exited, status=1/FAILURE) Okt 26 15:34:20 openqaworker7 systemd: Started Daily Cleanup of Snapper Snapshots. Okt 26 15:34:20 openqaworker7 systemd-helper: running cleanup for 'root'. Okt 26 15:34:20 openqaworker7 systemd-helper: running number cleanup for 'root'. Okt 26 15:34:20 openqaworker7 systemd-helper: Deleting snapshot failed. Okt 26 15:34:20 openqaworker7 systemd-helper: number cleanup for 'root' failed. Okt 26 15:34:20 openqaworker7 systemd-helper: running timeline cleanup for 'root'. Okt 26 15:34:20 openqaworker7 systemd-helper: running empty-pre-post cleanup for 'root'. Okt 26 15:34:21 openqaworker7 systemd: snapper-cleanup.service: Main process exited, code=exited, status=1/FAILURE Okt 26 15:34:21 openqaworker7 systemd: snapper-cleanup.service: Failed with result 'exit-code'.
#11 Updated by mkittler about 1 month ago
- Status changed from Feedback to In Progress
As mentioned by andriinikitin the redirection to https can be avoided via
sed -i 's,download.opensuse.org,mirrorcache.opensuse.org,g' /etc/zypp/repos.d/*.repo. That seems to work in the root-certificate-less environment of
transactional-update shell. So I'm upgrading the remaining workers now.
#12 Updated by mkittler about 1 month ago
The following units failed on openqaworker1 after a reboot under Leap 15.3:
openqaworker1:~ # systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● container-openqaworker1_container_102.service loaded failed failed Podman container-openqaworker1_container_102.service ● container-openqaworker1_container_103.service loaded failed failed Podman container-openqaworker1_container_103.service
I suspect these are leftovers from experimenting with a containerized setup so I disabled them for now.
#15 Updated by favogt about 1 month ago
Until the issue is fixed, the qemu-seabios package should be downgraded on the openQA workers for x86, like this:
zypper in --oldpackage https://download.opensuse.org/repositories/openSUSE:/Leap:/15.2:/Update/standard/noarch/qemu-seabios-1.12.1+-lp18.104.22.168.noarch.rpm zypper al https://download.opensuse.org/repositories/openSUSE:/Leap:/15.2:/Update/standard/noarch/qemu-seabios-1.12.1+-lp22.214.171.124.noarch.rpm
transactional-update pkg in and
zypper al instead)
I did this on ow7, but on the transactional systems I copied the bios files and used bind mounts instead, to avoid reboots which set back the test states.
I also noticed that os-autoinst uses
usb-ehci as controller by default while
qemu-xhci has various advantages, and opened a PR to switch to xhci: https://github.com/os-autoinst/os-autoinst/pull/1838. Incidentally, this also appears to work around the bios issue, so the package downgrade could be omitted if qemu-xhci is used.
#16 Updated by dheidler about 1 month ago
actually they were productive.
#18 Updated by favogt about 1 month ago
Due to https://bugzilla.opensuse.org/show_bug.cgi?id=1192126,
qemu-ovmf-x86_64 had to be downgraded to the Leap 15.2 version as well.
And maybe we can merge https://github.com/os-autoinst/os-autoinst/pull/1838 despite missing test coverage?
In theory we could probably revert the downgrade now that it uses XHCI, but staying on the older seabios for a bit longer won't hurt I'd say.
#20 Updated by favogt about 1 month ago
I've just checked two transactional workers and both have 1.14.0_0_g155821a-103.2 installed. Not sure about the bind mount but it looks like the workers are already back to normal anyways.
Looks like there are no zypper locks defined. Either they got deleted somehow or I added them incorrectly without noticing.
- Status changed from Resolved to Feedback
The PR fixes only the seabios issue so I'm downgrading the transactional workers now to cover the ovmf issue as well:
transactional-shell zypper in --oldpackage https://download.opensuse.org/update/leap/15.2/oss/noarch/qemu-ovmf-x86_64-201911-lp126.96.36.199.noarch.rpm zypper al qemu-ovmf exit reboot
So far I've done this only on openqaworker1 which is currently rebooting.
openqaworker1 is up again and the lock and package are in place. I've also checked the other workers but they all had the package and lock still in place. (Except for aarch64 but I suppose only the x86_64 workers are relevant here. And rebel doesn't have the package installed at all.)
It didn't work because it should have been
zypper al qemu-ovmf-x86_64. So I downgraded the package again. Judging by the job's history the
grub_test module generally works with the downgrade. I'll check tomorrow again whether the downgraded package survived the nightly update.