openQA Project - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4
Upgrade osd workers and openqa-monitor to openSUSE Leap 15.4
- Need to upgrade workers before EOL of Leap 15.3 and have a consistent environment
- AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.4 (no failed systemd services, no left over .rpm-new files, etc.)
- AC2: openqa-monitor runs openSUSE Leap 15.4
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the workers are only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- After upgrade reboot and check everything working as expected, if not rollback, e.g. with
- Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
- Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.4 size:M
- Description updated (diff)
- Assignee deleted (
- Priority changed from High to Normal
- Target version changed from Ready to future
- Due date set to 2022-08-04
Upgrading openqaworker11 and openqaworker12 manually as they are currently not in salt. openqaworker12 is still a Leap 15.2 so I will do a Leap15.2->15.4 direct upgrade for fun and because I am curious what happens.
Many repositories have changed the name format from "openSUSE_Leap_$releasever" to "$releasever" so we need to adapt for that.
Actually by now there are more machines controlled by our salt structure so let's upgrade them along the way
powerqaworker-qam-1.qa.suse.de has some conflicts about python2 packages, manually removed first.
sudo salt --no-color --state-output=changes -C 'powerqaworker-qam-1.qa.suse.de' cmd.run 'zypper -n rm -u python2-libxml2-python'
then all the machines:
sudo salt --no-color --state-output=changes -C 'not G@roles:webui' cmd.run '(rpm -q qemu-ovmf-x86_64 && zypper al qemu-ovmf-x86_64) ; zypper rr telegraf-monitoring && sed -i -e "s@/openSUSE_Leap_@/@g" /etc/zypp/repos.d/* && zypper -n --releasever=15.4 ref && zypper -n --releasever=15.4 dup --auto-agree-with-licenses --replacefiles --download-in-advance'
seems to have gone fine
$ salt \* grains.get oscodename storage.qa.suse.de: openSUSE Leap 15.4 openqaworker2.suse.de: openSUSE Leap 15.4 openqaworker3.suse.de: openSUSE Leap 15.4 openqaworker9.suse.de: openSUSE Leap 15.4 openqaworker6.suse.de: openSUSE Leap 15.4 QA-Power8-5-kvm.qa.suse.de: openSUSE Leap 15.4 openqaworker5.suse.de: openSUSE Leap 15.4 openqaworker14.qa.suse.cz: openSUSE Leap 15.4 powerqaworker-qam-1.qa.suse.de: openSUSE Leap 15.4 openqa-monitor.qa.suse.de: openSUSE Leap 15.4 QA-Power8-4-kvm.qa.suse.de: openSUSE Leap 15.4 openqaworker13.suse.de: openSUSE Leap 15.4 grenache-1.qa.suse.de: openSUSE Leap 15.4 openqa.suse.de: openSUSE Leap 15.4 openqaworker-arm-2.suse.de: openSUSE Leap 15.4 openqaworker8.suse.de: openSUSE Leap 15.4 backup.qa.suse.de: openSUSE Leap 15.4 openqaworker10.suse.de: openSUSE Leap 15.4 openqaworker-arm-1.suse.de: openSUSE Leap 15.4
except for openqaworker-arm-3 that repeatedly crashed. Need to try harder. Triggered reboot for most workers now.
openqaworker8, openqaworker9 and openqaworker14 failed to come up yet, "maintenance" mode on openqaworker8 and 9 at least, failed "openqa_nvme_format". The actual command failing is
mdadm --create /dev/md/openqa --level=0 --force --assume-clean --raid-devices=1 --run /dev/nvme0n1 with "mdadm: cannot open /dev/nvme0n1: Device or resource busy". Well, that's understandable as on openqaworker8+9 nvme0n1 has three partitions and is also used for the root filesystem hence it's already "busy". Only nvme0n1p3 should be used here.
The reason seems to be this:
├─nvme0n1p2 259:2 0 100G 0 part /var/tmp │ /var/spool │ /var/opt │ /var/log │ /var/lib/pgsql │ /var/lib/named │ /var/lib/mysql │ /var/lib/mariadb │ /var/lib/mailman │ /var/lib/libvirt/images │ /var/lib/machines │ /var/crash │ /var/cache │ /usr/local │ /tmp │ /srv │ /opt │ /boot/grub2/x86_64-efi │ /boot/grub2/i386-pc │ /.snapshots │ / └─nvme0n1p3 259:3 bash -ex /usr/local/bin/openqa-establish-nvme-setup # lsblk --noheadings | grep -v nvme | grep "/$" │ /
maybe lsblk has changed it's format. Fixed in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/714, patched openqaworker8+9 manually.
#114493 reported, this is also related to #111992 but on aarch64. Trying to install old package from http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/noarch/?P=*qemu-uefi*
sudo zypper -n in --oldpackage http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/noarch/qemu-uefi-aarch64-202008-10.8.1.noarch.rpm && sudo zypper al qemu-uefi-aarch64
that helped. openqaworker14 was suffering from the same "lsblk" parse problem as other machines. qa-power8-4+qa-power8-5 might still be problematic