action #157975
opencoordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
Upgrade osd workers to openSUSE Leap 15.6
0%
Description
Motivation¶
- Need to upgrade workers before EOL of Leap 15.5 and have a consistent environment
Acceptance criteria¶
- AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.)
Acceptance tests¶
- AT1-1:
sudo salt -C 'G@roles:worker and not G@osrelease:15.6' test.ping
is empty
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the workers are only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- After upgrade reboot and check everything working as expected, if not rollback, e.g. with
snapper rollback
Rollback steps¶
hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
ssh osd "sudo salt -C 'G@roles:worker' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"
Further details¶
- Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
Updated by okurz 4 months ago
- Copied from action #130588: Upgrade osd workers to openSUSE Leap 15.5 added
Updated by okurz about 1 month ago
Multiple workers upgraded themselves automatically to 15.6 as the main repo URL points to http://download.opensuse.org/distribution/openSUSE-current/repo/oss . Maybe that was me doing that? This might have caused #162239
Updated by okurz about 1 month ago
- Related to action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" command added
Updated by okurz about 1 month ago
- Assignee set to okurz
- Target version changed from Tools - Next to Ready
Updated by okurz about 1 month ago
- Assignee deleted (
okurz) - Target version changed from Ready to Tools - Next
Updated by okurz about 1 month ago
- Status changed from New to In Progress
- Assignee set to okurz
- Target version changed from Tools - Next to Ready
okurz wrote in #note-4:
Multiple workers upgraded themselves automatically to 15.6 as the main repo URL points to http://download.opensuse.org/distribution/openSUSE-current/repo/oss . Maybe that was me doing that? This might have caused #162239
I caused this problem with #132137-6 because then the installer uses openSUSE-current also in the deployed repositories but that's then incompatible with update repos which are either hardcoded to the version like 15.5 or use $releasever
.
So I called
sudo salt \* cmd.run 'sed -i -e "s@openSUSE-current@leap/\$releasever@" /etc/zypp/repos.d/*'
to correct. Now we should push through with the upgrade to Leap 15.6 to be consistent
Updated by okurz about 1 month ago
- Copied to action #162284: Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not work added
Updated by okurz about 1 month ago ยท Edited
- Description updated (diff)
First w31 where I triggered a reboot. After reboot it took 20m(!) for the machine to be fully reachable over network causing issues because os-autoinst-openvswitch would timeout causing incomplete jobs like https://openqa.suse.de/tests/14611797 . Triggered another reboot and then
for i in WORKER="worker31"; do host=openqa.suse.de failed_since=2024-06-14 comment="label:poo157975" ./openqa-advanced-retrigger-jobs; done
I disabled the openQA worker instances on w31 for now:
sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs)
and aborted reboot requests and masked rebootmgr for now:
sh osd "sudo salt -C 'G@roles:worker' cmd.run 'rebootmgrctl cancel && systemctl mask --now rebootmgr'"
and added according rollback steps.
We found that the system is not able to bring up the network completely and also crashes on a kernel panic so reverting to older version of kernel. systemctl status wicked*
showed that wicked-nanny is running into dbus related timeouts.
Booting into older kernel:
# grep 'submenu\|menuentry\>' /boot/grub2/grub.cfg
menuentry "Help on bootable snapshot #$snapshot_num" {
menuentry 'openSUSE Leap 15.6' --class opensuse --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-
b985246c-fac6-488c-9917-cb53e1d3fd6d' {
submenu 'Advanced options for openSUSE Leap 15.6' --hotkey=1 $menuentry_id_option 'gnulinux-advanced-b985246c-fac6-488c-9917-cb5
3e1d3fd6d' {
menuentry 'openSUSE Leap 15.6, with Linux 6.4.0-150600.21-default' --hotkey=2 --class opensuse --class gnu-linux --class
gnu --class os $menuentry_id_option 'gnulinux-6.4.0-150600.21-default-advanced-b985246c-fac6-488c-9917-cb53e1d3fd6d' {
menuentry 'openSUSE Leap 15.6, with Linux 6.4.0-150600.21-default (recovery mode)' --hotkey=3 --class opensuse --class $
nu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.4.0-150600.21-default-recovery-b985246c-fac6-488c-9917-cb53e1d$
fd6d' {
menuentry 'openSUSE Leap 15.6, with Linux 5.14.21-150500.55.65-default' --class opensuse --class gnu-linux --class gnu
--class os $menuentry_id_option 'gnulinux-5.14.21-150500.55.65-default-advanced-b985246c-fac6-488c-9917-cb53e1d3fd6d' {
menuentry 'openSUSE Leap 15.6, with Linux 5.14.21-150500.55.65-default (recovery mode)' --hotkey=1 --class opensuse --c$
ass gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.14.21-150500.55.65-default-recovery-b985246c-fac6-488c-99$
7-cb53e1d3fd6d' {
menuentry 'UEFI Firmware Settings' $menuentry_id_option 'uefi-firmware' {
worker31:~ # grub2-reboot '1>3'
Updated by okurz about 1 month ago
- Copied to action #162293: SMART errors on bootup of w31+w32, possibly more added
Updated by okurz about 1 month ago
That was not successful, no output after the initial kernel loading. I booted into 6.4 again and then called snapper rollback $id
on the $id
from 2024-06-12 before the upgrade to 15.6 happened. Rebooted and the system was coming up just fine again but struggles with SMART errors. For this I extracted a worker31 specific ticket into #162293. Crosschecked with w32 and it behaves the same. So I assume it's a generic problem which at least affects all our happyware servers w29-w40 the same, possibly o3 workers w21-w28 as well. Now doing the corresponding snapper rollback on all and patching the repo config as in #157975-8
Updated by okurz about 1 month ago
Downgraded w29, w30, w31, w32, w33, w34, w35, w40, warm1, warm2. w36-39 are offline as they are currently not used. And applied
sed -i -e "s@openSUSE-current@leap/\$releasever@" -e "s@15\.5/@\$releasever/@g" /etc/zypp/repos.d/* ; zypper ref && zypper -n dup --dry-run
on all. Also checked systemctl status
and alerts. Seems we are back to status "green" again for now. Also all systems currently in control of salt reachable. I recommend we try the upgrade again first with another machine, e.g. a NUE2 or PRG1 based one. openqaworker14 seems to be a good candidate running only worker classes that we have redundant and only qemu. Also we can consider w36-w40 related to #139103
Updated by okurz about 1 month ago
- Related to action #162260: auto-update.service fails on various workers due to a package conflict added
Updated by okurz 9 days ago
- Copied to action #163472: Upgrade a single osd worker to openSUSE Leap 15.6 added