Project

General

Profile

action #73189

coordination #69478: [epic] Upgrade o3+osd workers+webui to openSUSE Leap 15.2

Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already done

Added by okurz 10 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

  • Need to upgrade workers before EOL of Leap 15.1 and have a consistent environment

Acceptance criteria

  • AC1: all o3 worker machines run a clean upgraded openSUSE Leap 15.2 (no failed systemd services, no left over .rpm-new files, etc.)

Suggestions

  • read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
  • Reserve some time when the workers are only executing a few or no openQA test jobs
  • Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
  • Use the instructions from above but use transactional-update shell for transactional update workers
  • After upgrade reboot and check everything working as expected, if not rollback, e.g. with transactional-update rollback

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

Related issues

Related to openQA Infrastructure - action #75259: 100% of powerpc tests incomplete auto_review:"(?s)Running on power8.*qemu-system-ppc64: Requested safe cache capability level not supported by kvm":retryResolved2020-10-25

Copied from openQA Infrastructure - action #72079: Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"(?s)starting: /usr/bin/qemu-system-aarch64.*backend died: Migrate to file failed"Resolved2020-09-29

History

#1 Updated by okurz 10 months ago

  • Copied from action #72079: Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"(?s)starting: /usr/bin/qemu-system-aarch64.*backend died: Migrate to file failed" added

#2 Updated by cdywan 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

#3 Updated by okurz 10 months ago

be aware of the problems encountered in #72079 though. So upgrading other machines can help to crosscheck if problems reproduce there but this can also increase the impact for more users.

#4 Updated by cdywan 10 months ago

okurz wrote:

be aware of the problems encountered in #72079 though. So upgrading other machines can help to crosscheck if problems reproduce there but this can also increase the impact for more users.

Aye, I went through the docs and the steps of the upgrade but didn't activate the snapshot for now.

Two error messages stood out, and I was wondering if you saw this with openqa-aarch64:

  • /usr/sbin/grub2-install: warning: Couldn't find physical volume (null)
  • dracut module 'biosdevname' will not be installed, because command 'biosdevname' could not be found

I couldn't find anything in the grub configuration. Of course the missing module could simply be installed - I'm mostly wondering if this is pointing to an existing misconfiguration.

#5 Updated by okurz 10 months ago

Haven't seen these messages but I wouldn't worry. You can just make sure you have access over IPMI for a quick recovery if needed and try the reboot.

#6 Updated by okurz 9 months ago

cdywan I assume you did not yet upgrade any further workers. Do you still plan?

#7 Updated by cdywan 9 months ago

  • Status changed from In Progress to Workable

okurz wrote:

cdywan I assume you did not yet upgrade any further workers. Do you still plan?

I was going to mark it as blocked by #72079 but Redmine detracted me with a confusing error and trying to make it block the other ticket.

#8 Updated by okurz 9 months ago

All the problems in #72079 I consider likely to be aarch64 specifics so you can go ahead with the upgrade.

#9 Updated by okurz 9 months ago

Little hint: To try out upgrading in a safe environment you can run a openSUSE Leap 15.1 container, e.g. podman run --rm -it registry.opensuse.org/opensuse/leap:15.1 and run upgrade with https://github.com/okurz/auto-upgrade/blob/master/auto-upgrade

#10 Updated by cdywan 9 months ago

  • Status changed from Workable to In Progress

okurz wrote:

Little hint: To try out upgrading in a safe environment you can run a openSUSE Leap 15.1 container, e.g. podman run --rm -it registry.opensuse.org/opensuse/leap:15.1 and run upgrade with https://github.com/okurz/auto-upgrade/blob/master/auto-upgrade

I tried out the script. I like the idea. Unfortunatelely the use of screen consistently fails here with two different terminals. Will open a GitHub issue.

For now I'm proceeding with this one-liner (inside transactional-update --no-selfupdate shell):

old=15.1; new=15.2; zypper -n up --auto-agree-with-licenses --replacefiles && sed -i -e "s/$old/\$releasever/g" /etc/zypp/repos.d/* && zypper --releasever=$new -n ref && zypper --releasever=$new -n dup --auto-agree-with-licenses --replacefiles --download-in-advance && rpmconfigcheck

On imagetester I ran into this:

Download (curl) error for 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Leap_15.1/noarch/os-autoinst-distri-opensuse-deps-1.1603448722.02a909610-lp151.6188.1.noarch.rpm':

Error code: Curl error 60
Error message: SSL certificate problem: self signed certificate in certificate chain

So I tried openqaworker1 and also ran into (certificate) errors:

Repository 'Providing openQA dependencies (openSUSE_Leap_15.2)' is invalid.

[devel_openQA|http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Leap_15.2/] Valid metadata not found at specified URL

I thought maybe it's the slightly different steps so I used the ones from the wiki which I knew to have worked before (inside transactional-update --no-selfupdate shell):

new_version=15.2; . /etc/os-release && sed -i -e "s/${VERSION_ID}/\$releasever/g" /etc/zypp/repos.d/* && zypper --releasever=$new_version ref && zypper -n --releasever=$new_version dup --auto-agree-with-licenses --replacefiles --download-in-advance && rpmconfigcheck

I guess I need to find out now what's upsetting zypper, and even affecting transactional-update's ability to check for new versions.

#11 Updated by okurz 9 months ago

You need to run zypper ref outside the transactional-update shell.

I did with host "rebel" now:

sed -i -e 's@15\.1@\$releasever@g' -e 's@https:@http:@g' /etc/zypp/repos.d/*
zypper --releasever=15.2 -n ref
transactional-update --no-selfupdate shell
zypper --releasever=15.2 --no-refresh --no-gpg-check -n dup --auto-agree-with-licenses --replacefiles --download-in-advance
rpmconfigcheck
for i in $(cat /var/adm/rpmconfigcheck) ; do vimdiff ${i%.rpm*} $i ; done
rm $(cat /var/adm/rpmconfigcheck)
exit
reboot

what previously failed is either repository refresh or download of packages due to certificate mismatches.
Worked then with changing all https repos to http and calling zypper ref outside the transactional shell

EDIT: 2020-10-23 16:51 Fixed sed expression with second '-e'

#12 Updated by cdywan 9 months ago

Revised command(s) taking boo#1149131 into account:

export new_version=15.2; zypper --releasever=$new_version ref
transactional-update shell
. /etc/os-release && sed -i -e "s/${VERSION_ID}/\$releasever/g" /etc/zypp/repos.d/* && zypper --releasever=$new_version ref && zypper --releasever=$new_version dup --auto-agree-with-licenses --replacefiles --download-in-advance && rpmconfigcheck

for i in $(cat /var/adm/rpmconfigcheck); do vimdiff ${i%.rpm*} $i ; done && rm $(cat /var/adm/rpmconfigcheck)

Note: I uninstalled samba-libs-python-4.9.5 to resolve a conflict.

The upgrade of openqaworker1 went through now.

#13 Updated by cdywan 9 months ago

imagetester worked with these commands based on what okurz used (but note the second -e in the sed) - the double ref did not work like it did on openqaworker1:

sed -i -e 's@15\.1@\$releasever@g' -e 's@https:@http:@g' /etc/zypp/repos.d/* && zypper --releasever=15.2 -n ref && transactional-update --no-selfupdate shell
zypper --releasever=15.2 --no-refresh --no-gpg-check -n dup --auto-agree-with-licenses --replacefiles --download-in-advance && rpmconfigcheck
for i in $(cat /var/adm/rpmconfigcheck) ; do vimdiff ${i%.rpm*} $i ; done && rm $(cat /var/adm/rpmconfigcheck)

#14 Updated by cdywan 9 months ago

  • Status changed from In Progress to Feedback

openqaworker4, power8 and openqaworker7 are also looking good now

#15 Updated by okurz 9 months ago

I fail to access power8, it's stuck in petitboot it seems.
Following comments in https://progress.opensuse.org/issues/55607#note-6 I looked up in the shell of petitboot what mount points are available and found /var/petitboot/mnt/dev/sda3/boot/ with all the necessary kernel images installed.

Based on these files and what I could find in /var/petitboot/mnt/dev/sda3/boot/grub2/grub.cfg I did

kexec -l /var/petitboot/mnt/dev/sda3/boot/vmlinux-5.3.18-lp152.47-default --initrd=/var/petitboot/mnt/dev/sda3/boot/initrd-5.3.18-lp152.47-default "root=UUID=e0110d2d-84f3-4590-8d12-a472f82cac8b  root=/dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020-part3 disk=/dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020 resume=/dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020-part2 nomodeset console=hvc console=tty"
kexec -e

so now the machine is up but this is not reboot safe.

#16 Updated by okurz 9 months ago

  • Related to action #75259: 100% of powerpc tests incomplete auto_review:"(?s)Running on power8.*qemu-system-ppc64: Requested safe cache capability level not supported by kvm":retry added

#17 Updated by okurz 9 months ago

  • Due date set to 2020-11-06

cdywan please update, what's the current situation? Any further action planned, any problems?

#18 Updated by cdywan 9 months ago

I'm trying to confirm the state of the bootloader before marking it as Resolved, will report once I know if I have a work-around

#19 Updated by cdywan 9 months ago

So I tried out the work-around that nicksinger used in #68053 alas no dice, it´s still getting stuck in petitboot:

power8:~ # nvram --print-config
"common" Partition
---------------------
petitboot,bootdev=uuid:e0110d2d-84f3-4590-8d12-a472f82cac8b
petitboot,bootdevs=uuid:e0110d2d-84f3-4590-8d12-a472f82cac8b

power8:~ # nvram -p common --update-config "petitboot,snapshots?=false"
power8:~ # reboot

ssh root@power8
ssh: connect to host power8 port 22: No route to host

Next step, creating a new boot entry with sda3, /boot/vmlinux and /boot/initrd, kernels args from the kexec command above, boot works, entry is gone after reboot... 🙄 Tried a couple variations but couldn´t make my changes stick.

#20 Updated by cdywan 9 months ago

  • Due date changed from 2020-11-06 to 2020-11-13

So I checked /var/log/petitboot/pb-discover.log to confirm that snapshots were disabled if irrelevant in the context of this problem, the boot device e0110d2d-84f3-4590-8d12-a472f82cac8b is correct, however:

mounting device /dev/sdb1 read-only
Sending discover...
trying parsers for sdb1
SKIP: sdb: no ID_FS_TYPE property
[...]
trying parsers for sda3
parse error: 239('{'): syntax error, unexpected '{', expecting elif or else or fi
SKIP: sdb: no ID_FS_TYPE property
mounting device /dev/sdb1 read-only
trying parsers for sdb1
SKIP: loop0: ignored (path=/devices/virtual/block/loop0)

The loop errors seem semi-interesting in that they only occur during a re-scan after the first scan.

Re-running pb-discover --verbose -l /var/log/petitboot/pb-discover.log by hand after killing the daemon shows `trying parser 'grub2' but doesn't reveal anything else.

The file being parsed should be /var/petitboot/mnt/dev/sda3/boot/grub2/grub.cfg although this version of petitbot won't seem to reveal that, line 239 would be the hiddenentry block at the very end. So I uncommented that block and hurrah, items show up in petiboot's menu!

A reboot after that actually picked up the default.

The serial console got stuck afterwards, though, and wouldn't let me back in, and SSH was unresponsive, so I'm not positive this is working yet.

#21 Updated by cdywan 9 months ago

To add insult to injury, the temporary manual boot entry stops at entering emergency mode now, just like booting from the detected entries does.

#22 Updated by cdywan 9 months ago

cdywan wrote:

The file being parsed should be /var/petitboot/mnt/dev/sda3/boot/grub2/grub.cfg although this version of petitbot won't seem to reveal that, line 239 would be the hiddenentry block at the very end. So I uncommented that block and hurrah, items show up in petiboot's menu!

Tried re-enabling the contents of the block (enabling textmode) and finally undoing my fix, and it still doesn't boot. I guess I'll try and pick some more of nicksinger's brain in the morning.

#23 Updated by cdywan 9 months ago

  • Status changed from Feedback to Resolved

Apparently somewhere down the line the filesystem was not in sync and I was looking at the wrong problem. Boot seems fine now w/o manual intervention.

#24 Updated by okurz 8 months ago

  • Due date deleted (2020-11-13)

Also available in: Atom PDF