action #99189
closedcoordination #99183: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui, to openSUSE Leap 15.3
Upgrade o3 workers to openSUSE Leap 15.3 size:M
0%
Description
Motivation¶
- Need to upgrade workers before EOL of Leap 15.2 and have a consistent environment
Acceptance criteria¶
- AC1: all o3 worker machines run a clean upgraded openSUSE Leap 15.3 (no failed systemd services, no left over .rpm-new files, etc.)
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the workers are only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- Use the instructions from above but use
transactional-update shell
for transactional update workers - After upgrade reboot and check everything working as expected, if not rollback, e.g. with
transactional-update rollback
Further details¶
- Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
Updated by okurz about 3 years ago
- Copied from action #73189: Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already done added
Updated by okurz about 3 years ago
- Subject changed from Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already done to Upgrade o3 workers to openSUSE Leap 15.3
- Description updated (diff)
- Assignee deleted (
livdywan) - Priority changed from High to Normal
Updated by livdywan about 3 years ago
- Subject changed from Upgrade o3 workers to openSUSE Leap 15.3 to Upgrade o3 workers to openSUSE Leap 15.3 size:M
- Status changed from New to Workable
Updated by ggardet_arm about 3 years ago
oss-cobbler-03
is a new remote aarch64 machines for aarch64 (and aarch32) based on Leap 15.3 already.
This is a Ampere eMag machine with 32 cores, 125G of RAM and 480G of SSD storage. Currently, it runs 7 workers with no problem.
Updated by mkittler about 3 years ago
- Status changed from Workable to Feedback
Not sure how to upgrade the transactional workers. I cannot directly follow commands on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Distribution-upgrades. If I try to run them in transactional-update shell
the commands don't work because no root certificates exist in that environment. Just enter openqaworker1:~ # transactional-update shell
and find that /var/lib/ca-certificates/pem
is empty. The problem is also reproducible on openqaworker4
and possibly all other workers which are transactional servers. Note that /etc/ssl/certs
is a symlink to /var/lib/ca-certificates/pem
.
Updated by mkittler about 3 years ago
I've been upgrading power8 and everything seemed to work well. However, it didn't came back and I cannot even reach it via ipmitool -I lanplus -C 3 -H openqaworker-power8-ipmi.suse.de -U ADMIN -P ADMIN sol activate
now.
I had to create an Infra ticket regarding power8: https://sd.suse.com/servicedesk/customer/portal/1/SD-64563
Updated by mkittler about 3 years ago
@ffogt helped recovering power8
:
https://bugzilla.suse.com/show_bug.cgi?id=1174166
I used the petitboot environment to chroot into the leap install and replaced kernel-kvmsmall with kernel-default
It's up!
nfs-client was not installed, also probably because of kernel-kvmsmall
I just watched it boot after a power reset, which needs some patience
Apparently the ipmi power commands and sol session worked at some point after all. It now works for me now as well. I'll respond in the Infra ticket.
Updated by mkittler about 3 years ago
I've now been upgrading power8
and openqaworker7
. Both generally work now. So aarch64 openqaworker1 openqaworker4 imagetester and rebel are still remaining.
Note that there's a failing service on openqaworker7
but it has been failing in that way at least since Sep 05 03:46:26
(which is almost as far as the logs go back):
openqaworker7:~ # systemctl status snapper-cleanup.service
● snapper-cleanup.service - Daily Cleanup of Snapper Snapshots
Loaded: loaded (/usr/lib/systemd/system/snapper-cleanup.service; static)
Active: failed (Result: exit-code) since Tue 2021-10-26 15:34:21 CEST; 2min 30s ago
TriggeredBy: ● snapper-cleanup.timer
Docs: man:snapper(8)
man:snapper-configs(5)
Process: 23457 ExecStart=/usr/lib/snapper/systemd-helper --cleanup (code=exited, status=1/FAILURE)
Main PID: 23457 (code=exited, status=1/FAILURE)
Okt 26 15:34:20 openqaworker7 systemd[1]: Started Daily Cleanup of Snapper Snapshots.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: running cleanup for 'root'.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: running number cleanup for 'root'.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: Deleting snapshot failed.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: number cleanup for 'root' failed.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: running timeline cleanup for 'root'.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: running empty-pre-post cleanup for 'root'.
Okt 26 15:34:21 openqaworker7 systemd[1]: snapper-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Okt 26 15:34:21 openqaworker7 systemd[1]: snapper-cleanup.service: Failed with result 'exit-code'.
Updated by mkittler about 3 years ago
- Status changed from Feedback to In Progress
As mentioned by @andriinikitin the redirection to https can be avoided via sed -i 's,download.opensuse.org,mirrorcache.opensuse.org,g' /etc/zypp/repos.d/*.repo
. That seems to work in the root-certificate-less environment of transactional-update shell
. So I'm upgrading the remaining workers now.
Updated by mkittler about 3 years ago
The following units failed on openqaworker1 after a reboot under Leap 15.3:
openqaworker1:~ # systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● container-openqaworker1_container_102.service loaded failed failed Podman container-openqaworker1_container_102.service
● container-openqaworker1_container_103.service loaded failed failed Podman container-openqaworker1_container_103.service
I suspect these are leftovers from experimenting with a containerized setup so I disabled them for now.
Updated by mkittler about 3 years ago
- Status changed from In Progress to Feedback
All workers (openqaworker1 openqaworker4 openqaworker7 power8 rebel imagetester aarch64) are now running on Leap 15.3. Besides the (unimportant) services mentioned in the comments it looks good so far.
Updated by okurz about 3 years ago
A whole lot of openSUSE Tumbleweed tests have been failing after the upgrade. Dimstar and fvogt narrowed down the problem to a qemu-seabios update. Was reported in https://bugzilla.suse.com/show_bug.cgi?id=1192115
Updated by favogt about 3 years ago
Until the issue is fixed, the qemu-seabios package should be downgraded on the openQA workers for x86, like this:
zypper in --oldpackage https://download.opensuse.org/repositories/openSUSE:/Leap:/15.2:/Update/standard/noarch/qemu-seabios-1.12.1+-lp152.9.20.1.noarch.rpm
zypper al https://download.opensuse.org/repositories/openSUSE:/Leap:/15.2:/Update/standard/noarch/qemu-seabios-1.12.1+-lp152.9.20.1.noarch.rpm
(resp. with transactional-update pkg in
and zypper al
instead)
I did this on ow7, but on the transactional systems I copied the bios files and used bind mounts instead, to avoid reboots which set back the test states.
I also noticed that os-autoinst uses usb-ehci
as controller by default while qemu-xhci
has various advantages, and opened a PR to switch to xhci: https://github.com/os-autoinst/os-autoinst/pull/1838. Incidentally, this also appears to work around the bios issue, so the package downgrade could be omitted if qemu-xhci is used.
Updated by dheidler about 3 years ago
See https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers and https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers
actually they were productive.
Updated by mkittler about 3 years ago
I'm aware of the issues and also noted it in #101265#note-9.
Is there anything left to do for o3 workers (since @favogt already took care of them)? And maybe we can merge https://github.com/os-autoinst/os-autoinst/pull/1838 despite missing test coverage?
Updated by favogt about 3 years ago
Due to https://bugzilla.opensuse.org/show_bug.cgi?id=1192126, qemu-ovmf-x86_64
had to be downgraded to the Leap 15.2 version as well.
And maybe we can merge https://github.com/os-autoinst/os-autoinst/pull/1838 despite missing test coverage?
In theory we could probably revert the downgrade now that it uses XHCI, but staying on the older seabios for a bit longer won't hurt I'd say.
Updated by mkittler about 3 years ago
I've just checked two transactional workers and both have 1.14.0_0_g155821a-103.2 installed. Not sure about the bind mount but it looks like the workers are already back to normal anyways.
Updated by favogt about 3 years ago
mkittler wrote:
I've just checked two transactional workers and both have 1.14.0_0_g155821a-103.2 installed. Not sure about the bind mount but it looks like the workers are already back to normal anyways.
Looks like there are no zypper locks defined. Either they got deleted somehow or I added them incorrectly without noticing.
Updated by mkittler about 3 years ago
Then I'll leave the systems as they are except for the non-transactional worker openqaworker7 where I removed the lock.
I suppose this ticket can be considered resolved - or were there any other problems?
Updated by mkittler about 3 years ago
- Status changed from Feedback to Resolved
Looks like nothing else came up so I'm closing this issue.
Updated by mkittler about 3 years ago
- Status changed from Resolved to Feedback
The PR fixes only the seabios issue so I'm downgrading the transactional workers now to cover the ovmf issue as well:
transactional-shell
zypper in --oldpackage https://download.opensuse.org/update/leap/15.2/oss/noarch/qemu-ovmf-x86_64-201911-lp152.6.17.1.noarch.rpm
zypper al qemu-ovmf
exit
reboot
So far I've done this only on openqaworker1 which is currently rebooting.
Updated by mkittler about 3 years ago
openqaworker1 is up again and the lock and package are in place. I've also checked the other workers but they all had the package and lock still in place. (Except for aarch64 but I suppose only the x86_64 workers are relevant here. And rebel doesn't have the package installed at all.)
Updated by mkittler about 3 years ago
It didn't work because it should have been zypper al qemu-ovmf-x86_64
. So I downgraded the package again. Judging by the job's history the grub_test
module generally works with the downgrade. I'll check tomorrow again whether the downgraded package survived the nightly update.
Updated by mkittler about 3 years ago
- Status changed from Feedback to Resolved
The downgrade survived the nightly transactional update. Also further test runs (conducted on different workers) worked.
Updated by okurz over 2 years ago
- Copied to action #111863: Upgrade o3 workers to openSUSE Leap 15.4 size:M added