action #73189
closedcoordination #69478: [epic] Upgrade o3+osd workers+webui to openSUSE Leap 15.2
Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already done
Added by okurz about 4 years ago. Updated about 4 years ago.
0%
Description
Motivation¶
- Need to upgrade workers before EOL of Leap 15.1 and have a consistent environment
Acceptance criteria¶
- AC1: all o3 worker machines run a clean upgraded openSUSE Leap 15.2 (no failed systemd services, no left over .rpm-new files, etc.)
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the workers are only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- Use the instructions from above but use
transactional-update shell
for transactional update workers - After upgrade reboot and check everything working as expected, if not rollback, e.g. with
transactional-update rollback
Further details¶
- Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
Updated by okurz about 4 years ago
- Copied from action #72079: Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"(?s)starting: /usr/bin/qemu-system-aarch64.*backend died: Migrate to file failed" added
Updated by livdywan about 4 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Updated by okurz about 4 years ago
be aware of the problems encountered in #72079 though. So upgrading other machines can help to crosscheck if problems reproduce there but this can also increase the impact for more users.
Updated by livdywan about 4 years ago
okurz wrote:
be aware of the problems encountered in #72079 though. So upgrading other machines can help to crosscheck if problems reproduce there but this can also increase the impact for more users.
Aye, I went through the docs and the steps of the upgrade but didn't activate the snapshot for now.
Two error messages stood out, and I was wondering if you saw this with openqa-aarch64
:
- /usr/sbin/grub2-install: warning: Couldn't find physical volume
(null)
- dracut module 'biosdevname' will not be installed, because command 'biosdevname' could not be found
I couldn't find anything in the grub configuration. Of course the missing module could simply be installed - I'm mostly wondering if this is pointing to an existing misconfiguration.
Updated by okurz about 4 years ago
Haven't seen these messages but I wouldn't worry. You can just make sure you have access over IPMI for a quick recovery if needed and try the reboot.
Updated by okurz about 4 years ago
@cdywan I assume you did not yet upgrade any further workers. Do you still plan?
Updated by livdywan about 4 years ago
- Status changed from In Progress to Workable
okurz wrote:
@cdywan I assume you did not yet upgrade any further workers. Do you still plan?
I was going to mark it as blocked by #72079 but Redmine detracted me with a confusing error and trying to make it block the other ticket.
Updated by okurz about 4 years ago
All the problems in #72079 I consider likely to be aarch64 specifics so you can go ahead with the upgrade.
Updated by okurz about 4 years ago
Little hint: To try out upgrading in a safe environment you can run a openSUSE Leap 15.1 container, e.g. podman run --rm -it registry.opensuse.org/opensuse/leap:15.1
and run upgrade with https://github.com/okurz/auto-upgrade/blob/master/auto-upgrade
Updated by livdywan about 4 years ago
- Status changed from Workable to In Progress
okurz wrote:
Little hint: To try out upgrading in a safe environment you can run a openSUSE Leap 15.1 container, e.g.
podman run --rm -it registry.opensuse.org/opensuse/leap:15.1
and run upgrade with https://github.com/okurz/auto-upgrade/blob/master/auto-upgrade
I tried out the script. I like the idea. Unfortunatelely the use of screen consistently fails here with two different terminals. Will open a GitHub issue.
For now I'm proceeding with this one-liner (inside transactional-update --no-selfupdate shell
):
old=15.1; new=15.2; zypper -n up --auto-agree-with-licenses --replacefiles && sed -i -e "s/$old/\$releasever/g" /etc/zypp/repos.d/* && zypper --releasever=$new -n ref && zypper --releasever=$new -n dup --auto-agree-with-licenses --replacefiles --download-in-advance && rpmconfigcheck
On imagetester
I ran into this:
Download (curl) error for 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Leap_15.1/noarch/os-autoinst-distri-opensuse-deps-1.1603448722.02a909610-lp151.6188.1.noarch.rpm':
Error code: Curl error 60
Error message: SSL certificate problem: self signed certificate in certificate chain
So I tried openqaworker1
and also ran into (certificate) errors:
Repository 'Providing openQA dependencies (openSUSE_Leap_15.2)' is invalid.
[devel_openQA|http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Leap_15.2/] Valid metadata not found at specified URL
I thought maybe it's the slightly different steps so I used the ones from the wiki which I knew to have worked before (inside transactional-update --no-selfupdate shell
):
new_version=15.2; . /etc/os-release && sed -i -e "s/${VERSION_ID}/\$releasever/g" /etc/zypp/repos.d/* && zypper --releasever=$new_version ref && zypper -n --releasever=$new_version dup --auto-agree-with-licenses --replacefiles --download-in-advance && rpmconfigcheck
I guess I need to find out now what's upsetting zypper, and even affecting transactional-update's ability to check for new versions.
Updated by okurz about 4 years ago
You need to run zypper ref
outside the transactional-update shell.
I did with host "rebel" now:
sed -i -e 's@15\.1@\$releasever@g' -e 's@https:@http:@g' /etc/zypp/repos.d/*
zypper --releasever=15.2 -n ref
transactional-update --no-selfupdate shell
zypper --releasever=15.2 --no-refresh --no-gpg-check -n dup --auto-agree-with-licenses --replacefiles --download-in-advance
rpmconfigcheck
for i in $(cat /var/adm/rpmconfigcheck) ; do vimdiff ${i%.rpm*} $i ; done
rm $(cat /var/adm/rpmconfigcheck)
exit
reboot
what previously failed is either repository refresh or download of packages due to certificate mismatches.
Worked then with changing all https repos to http and calling zypper ref outside the transactional shell
EDIT: 2020-10-23 16:51 Fixed sed expression with second '-e'
Updated by livdywan about 4 years ago
Revised command(s) taking boo#1149131 into account:
export new_version=15.2; zypper --releasever=$new_version ref
transactional-update shell
. /etc/os-release && sed -i -e "s/${VERSION_ID}/\$releasever/g" /etc/zypp/repos.d/* && zypper --releasever=$new_version ref && zypper --releasever=$new_version dup --auto-agree-with-licenses --replacefiles --download-in-advance && rpmconfigcheck
for i in $(cat /var/adm/rpmconfigcheck); do vimdiff ${i%.rpm*} $i ; done && rm $(cat /var/adm/rpmconfigcheck)
Note: I uninstalled samba-libs-python-4.9.5 to resolve a conflict.
The upgrade of openqaworker1 went through now.
Updated by livdywan about 4 years ago
imagetester worked with these commands based on what @okurz used (but note the second -e
in the sed) - the double ref did not work like it did on openqaworker1:
sed -i -e 's@15\.1@\$releasever@g' -e 's@https:@http:@g' /etc/zypp/repos.d/* && zypper --releasever=15.2 -n ref && transactional-update --no-selfupdate shell
zypper --releasever=15.2 --no-refresh --no-gpg-check -n dup --auto-agree-with-licenses --replacefiles --download-in-advance && rpmconfigcheck
for i in $(cat /var/adm/rpmconfigcheck) ; do vimdiff ${i%.rpm*} $i ; done && rm $(cat /var/adm/rpmconfigcheck)
Updated by livdywan about 4 years ago
- Status changed from In Progress to Feedback
openqaworker4, power8 and openqaworker7 are also looking good now
Updated by okurz about 4 years ago
I fail to access power8, it's stuck in petitboot it seems.
Following comments in https://progress.opensuse.org/issues/55607#note-6 I looked up in the shell of petitboot what mount points are available and found /var/petitboot/mnt/dev/sda3/boot/ with all the necessary kernel images installed.
Based on these files and what I could find in /var/petitboot/mnt/dev/sda3/boot/grub2/grub.cfg I did
kexec -l /var/petitboot/mnt/dev/sda3/boot/vmlinux-5.3.18-lp152.47-default --initrd=/var/petitboot/mnt/dev/sda3/boot/initrd-5.3.18-lp152.47-default "root=UUID=e0110d2d-84f3-4590-8d12-a472f82cac8b root=/dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020-part3 disk=/dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020 resume=/dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020-part2 nomodeset console=hvc console=tty"
kexec -e
so now the machine is up but this is not reboot safe.
Updated by okurz about 4 years ago
- Related to action #75259: 100% of powerpc tests incomplete auto_review:"(?s)Running on power8.*qemu-system-ppc64: Requested safe cache capability level not supported by kvm":retry added
Updated by okurz about 4 years ago
- Due date set to 2020-11-06
@cdywan please update, what's the current situation? Any further action planned, any problems?
Updated by livdywan about 4 years ago
I'm trying to confirm the state of the bootloader before marking it as Resolved, will report once I know if I have a work-around
Updated by livdywan about 4 years ago
So I tried out the work-around that @nicksinger used in #68053 alas no dice, it´s still getting stuck in petitboot:
power8:~ # nvram --print-config
"common" Partition
---------------------
petitboot,bootdev=uuid:e0110d2d-84f3-4590-8d12-a472f82cac8b
petitboot,bootdevs=uuid:e0110d2d-84f3-4590-8d12-a472f82cac8b
power8:~ # nvram -p common --update-config "petitboot,snapshots?=false"
power8:~ # reboot
ssh root@power8
ssh: connect to host power8 port 22: No route to host
Next step, creating a new boot entry with sda3, /boot/vmlinux and /boot/initrd, kernels args from the kexec command above, boot works, entry is gone after reboot... 🙄 Tried a couple variations but couldn´t make my changes stick.
Updated by livdywan about 4 years ago
- Due date changed from 2020-11-06 to 2020-11-13
So I checked /var/log/petitboot/pb-discover.log
to confirm that snapshots were disabled if irrelevant in the context of this problem, the boot device e0110d2d-84f3-4590-8d12-a472f82cac8b
is correct, however:
mounting device /dev/sdb1 read-only
Sending discover...
trying parsers for sdb1
SKIP: sdb: no ID_FS_TYPE property
[...]
trying parsers for sda3
parse error: 239('{'): syntax error, unexpected '{', expecting elif or else or fi
SKIP: sdb: no ID_FS_TYPE property
mounting device /dev/sdb1 read-only
trying parsers for sdb1
SKIP: loop0: ignored (path=/devices/virtual/block/loop0)
The loop errors seem semi-interesting in that they only occur during a re-scan after the first scan.
Re-running pb-discover --verbose -l /var/log/petitboot/pb-discover.log
by hand after killing the daemon shows `trying parser 'grub2'
but doesn't reveal anything else.
The file being parsed should be /var/petitboot/mnt/dev/sda3/boot/grub2/grub.cfg
although this version of petitbot won't seem to reveal that, line 239 would be the hiddenentry
block at the very end. So I uncommented that block and hurrah, items show up in petiboot's menu!
A reboot after that actually picked up the default.
The serial console got stuck afterwards, though, and wouldn't let me back in, and SSH was unresponsive, so I'm not positive this is working yet.
Updated by livdywan about 4 years ago
To add insult to injury, the temporary manual boot entry stops at entering emergency mode
now, just like booting from the detected entries does.
Updated by livdywan about 4 years ago
cdywan wrote:
The file being parsed should be
/var/petitboot/mnt/dev/sda3/boot/grub2/grub.cfg
although this version of petitbot won't seem to reveal that, line 239 would be thehiddenentry
block at the very end. So I uncommented that block and hurrah, items show up in petiboot's menu!
Tried re-enabling the contents of the block (enabling textmode) and finally undoing my fix, and it still doesn't boot. I guess I'll try and pick some more of @nicksinger's brain in the morning.
Updated by livdywan about 4 years ago
- Status changed from Feedback to Resolved
Apparently somewhere down the line the filesystem was not in sync and I was looking at the wrong problem. Boot seems fine now w/o manual intervention.
Updated by okurz about 3 years ago
- Copied to action #99189: Upgrade o3 workers to openSUSE Leap 15.3 size:M added