action #99192
closedcoordination #99183: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui, to openSUSE Leap 15.3
Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M
0%
Description
Motivation¶
- Need to upgrade workers before EOL of Leap 15.2 and have a consistent environment
Acceptance criteria¶
- AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.3 (no failed systemd services, no left over .rpm-new files, etc.)
- AC2: openqa-monitor runs openSUSE Leap 15.3
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the workers are only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- After upgrade reboot and check everything working as expected, if not rollback, e.g. with
snapper rollback
Further details¶
Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
for reference the upgrade to openSUSE Leap 15.1 was described #55607
Files
Updated by okurz about 3 years ago
- Copied from action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 added
Updated by okurz about 3 years ago
- Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3
- Description updated (diff)
- Assignee deleted (
livdywan) - Priority changed from High to Normal
Updated by mkittler almost 3 years ago
- Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M
- Status changed from New to Workable
Updated by kodymo almost 3 years ago
- Status changed from Workable to In Progress
- Assignee set to kodymo
Updated by mkittler almost 3 years ago
On o3 we've noticed problems (see #99189#note-15 and subsequent comments), so I suppose it makes sense to add a lock before upgrading:
zypper al qemu-ovmf-x86_64
A lock for qemu-seabios
shouldn't be necessary (as of https://github.com/os-autoinst/os-autoinst/pull/1838).
Updated by okurz almost 3 years ago
- Status changed from In Progress to Workable
effectively not "in progress", setting back to "Workable".
Updated by okurz almost 3 years ago
- Related to action #103683: [tools][sle][x86_64][aarch64][QEMUTPM] install package "swtpm" on x86_64 and aarch64 workers added
Updated by livdywan almost 3 years ago
- Status changed from Workable to In Progress
- Assignee changed from kodymo to livdywan
Since Moritz isn't available atm and we'd really benefit from the upgrades (e.g. swtpm deps) I'm going ahead now. I'll try and upgrade through salt instead of shells on the workers as suggested in the daily.
Updated by livdywan almost 3 years ago
The following commands can be run via salt i.e. sudo salt -C 'G@roles:worker' cmd.run '...'
where ... is the command to execute on the machine:
- To confirm which workers still need upgrading:
grep VERSION=\"15.2 /etc/os-release || true
- To download packages:
zypper --releasever=15.3 ref && zypper --releasever=15.3 -n dup --allow-vendor-change --download-only
- To perform the upgrade
zypper --no-refresh --releasever=15.3 -n dup --allow-vendor-change --auto-agree-with-licenses --replacefiles
To start with I'm only preparing the upgrade, which shouldn't cause much trouble
(Edit: Updated above commands after the fact in case they're copied and re-used in the future, but also mentioned below)
Updated by livdywan almost 3 years ago
cdywan wrote:
- To perform the upgrade
zypper --releasever=15.3 -n dup --allow-vendor-change --auto-agree-with-licenses --replacefiles --download-in-advance
To start with I'm only preparing the upgrade, which shouldn't cause much trouble
Much quicker than I thought. Since only 58 out of 248 are working I decided to go for it.
Updated by openqa_review almost 3 years ago
- Due date set to 2021-12-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan almost 3 years ago
- Related to action #104025: Grafana: grenache-1: partitions usage (%) alert added
Updated by livdywan almost 3 years ago
- Related to action #104016: Broken VirtualBox kernel module on x86_64 OSD workers added
Updated by mkittler almost 3 years ago
It looks like openqaworker3
remained unaffected (is still on Leap 15.2).
Note that #104016 was simply caused by a missing reboot. It would make sense to reboot the machines shortly after doing the upgrade.
Updated by livdywan almost 3 years ago
mkittler wrote:
It looks like
openqaworker3
remained unaffected (is still on Leap 15.2).Note that #104016 was simply caused by a missing reboot. It would make sense to reboot the machines shortly after doing the upgrade.
Ack. I was apparently mistaken that the workers would reboot daily anyway, that's why I thought I could avoid an extra reboot
Updated by livdywan almost 3 years ago
As suggested by @mkittler I'm now using sudo salt -C 'G@roles:worker' cmd.run 'needs-restarting --reboothint'
to check if any machines still needed a reboot (and unlike sudo rebootmgrctl status
it doesn't require priviledge escalation, and doesn't need me to check the running kernel)
So next step checking the remaining workers which show up as Reboot is suggested
.
Updated by okurz almost 3 years ago
- Blocked by action #104142: osd-deployment pipeline failed: File ... not found on medium added
Updated by livdywan almost 3 years ago
There were issues with 3 workers that I was still investigating (because, well, in alot of output it wasn't immediately clear) will update comment from notes in a minute
openqaworker-arm-1.suse.de
Rebooted twice since it was still flagged as Reboot is suggested
openqaworker-arm-2.suse.de
Rebooted, and now looking to run new packages
openqaworker3.suse.de
I couldn't identify any issues and solely re-ran dup and rebooted 🤷️
According to salt the machine is still running 15.2 despite my having completed the upgrade and ssh is unresponsive 🧐️
sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active --quiet sshd || systemctl restart sshd || systemctl status sshd'
Dec 17 13:49:22 openqaworker3 sshd-gen-keys-start[7203]: /usr/sbin/sshd-gen-keys-start: line 7: ssh-keygen: command not found
sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active --quiet sshd || zypper in -n openssh-common'
No update candidate for 'openssh-common-8.4p1-3.3.1.x86_64'. The highest available version is already installed.
So somehow ssh-keygen
is unaccounted for 🤔️
Apparently we had inconsistencies because all repos weren't auto-refreshing: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/631
malbec.arch.suse.de
grenache-1.qa.suse.de
Rebooted, and now looking to run new packages
grenache also needed a little kick in the vine:
> systemctl --failed
os-autoinst-openvswitch.service loaded failed failed os-autoinst openvswitch helper
> sudo systemctl restart os-autoinst-openvswitch.service
- Should document in the wiki to check if zypper reports any orphans after upgrades (#104142) i.e.
zypper packages --orphaned
- Consider re-installing via salt, after dup&other clean-ups i.e.
salt state-apply
so that we're in a state with all required packages but w/o manually installed or left-over packages that aren't in salt Use-q
to avoid getting lost in logs-q
is much too quiet 🤐️- Perform the upgrade with
--no-refresh
to avoid race conditions after prior package refresh and download - I didn't cover any inactive machines here
Updated by okurz almost 3 years ago
On openqaworker3 a problem was that zypper dup did not want to follow through originally so something was missed and we ended up with non-operative sshd after reboot.
The proper fix is in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/631 to make the SUSE_CA repo consistently auto-refresh, same as others.
We started with looking into rpm files and did salt '*' cmd.run 'test -e /etc/salt/minion.rpmnew && mv /etc/salt/minion{,.bak} && mv /etc/salt/minion{.rpmnew,}' && salt '*' state.sls_id /etc/salt/minion salt.minion
Updated by okurz almost 3 years ago
- File diff_openqaworker3_after_leap15.3_upgrade.diff diff_openqaworker3_after_leap15.3_upgrade.diff added
complete etc file diff from openqaworker3 (except for /etc/salt/minion which we already covered) from salt 'openqaworker3.suse.de' cmd.run 'rpmconfigcheck && for i in $(cat /var/adm/rpmconfigcheck) ; do diff -Naur ${i%.rpm*} $i | grep -v \# ; done' > diff_openqaworker3_after_leap15.3_upgrade.diff
attached
Updated by okurz almost 3 years ago
Fix for kvm udev rules deployment: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/632
see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/750090#L8454
Updated by livdywan almost 3 years ago
We seem to have extra packages on openqaworker3
:
> sudo salt -C 'G@roles:worker' cmd.run 'zypper packages --orphaned'
openqaworker3.suse.de:
i | @System | libply-boot-client4 | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System | libply-splash-core4 | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System | libply-splash-graphics4 | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System | libply4 | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System | libyui-ncurses11 | 2.54.5-lp152.1.3 | x86_64
i | @System | libyui11 | 3.9.3-lp152.1.3 | x86_64
> sudo salt -C 'G@roles:worker' cmd.run 'zypper rm -y -u $(zypper packages --orphaned | awk "/^i/{ print $5 }" ORS=" ") hello'
> Installation has completed with error
That last message doesn't seem to contradict the fact that the packages are gone 🤓️
Updated by livdywan almost 3 years ago
- Status changed from In Progress to Feedback
All workers seem to be in a consistent state, so I would consider this done and maybe just leave this ticket as a reference w/o trying to generelize it for now (and consider that next time)
Updated by okurz almost 3 years ago
- Status changed from Feedback to Resolved
A related issue that came up recently: https://bugzilla.suse.com/show_bug.cgi?id=1192126
I added more steps and hints to the upgrade section on our wiki with https://progress.opensuse.org/projects/openqav3/wiki/Wiki/diff?utf8=%E2%9C%93&version=138&version_from=137&commit=View+differences . This should suffice then.
I resolved #104142 so we can resolve this one as well now.
Updated by okurz almost 3 years ago
- Related to action #104077: backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:M added
Updated by livdywan almost 3 years ago
- Status changed from Resolved to In Progress
okurz wrote:
- AC2: openqa-monitor runs openSUSE Leap 15.3
I need to re-open the ticket. There are still outstanding issues with the monitor host which need to be addressed. This is the dup
call:
( 286/1428) Removing kernel-default-5.3.18-lp152.106.1.x86_64 ...................................................................[error]
Removal of (98322)kernel-default-5.3.18-lp152.106.1.x86_64(@System) failed:
Error: Subprocess failed. Error: RPM failed: /var/tmp/rpm-tmp.AraSWV: line 1: /usr/lib/module-init-tools/kernel-scriptlets/rpm-preun: No such file or directory
error: %preun(kernel-default-5.3.18-lp152.106.1.x86_64) scriptlet failed, exit status 127
error: kernel-default-5.3.18-lp152.106.1.x86_64: erase failed
Updated by livdywan almost 3 years ago
cdywan wrote:
okurz wrote:
- AC2: openqa-monitor runs openSUSE Leap 15.3
I need to re-open the ticket. There are still outstanding issues with the monitor host which need to be addressed. This is the
dup
call:( 286/1428) Removing kernel-default-5.3.18-lp152.106.1.x86_64 ...................................................................[error] Removal of (98322)kernel-default-5.3.18-lp152.106.1.x86_64(@System) failed: Error: Subprocess failed. Error: RPM failed: /var/tmp/rpm-tmp.AraSWV: line 1: /usr/lib/module-init-tools/kernel-scriptlets/rpm-preun: No such file or directory error: %preun(kernel-default-5.3.18-lp152.106.1.x86_64) scriptlet failed, exit status 127 error: kernel-default-5.3.18-lp152.106.1.x86_64: erase failed
Apparently a re-run went through:
Executing %posttrans script 'kernel-firmware-amdgpu-20210208-2.4.noarch.rpm' ..................................<70%>=================[/]
Output of dmraid-1.0.0.rc16-3.26.x86_64.rpm %posttrans script:
Updating /etc/sysconfig/dmraid ...
Output of apache2-2.4.43-3.32.1.x86_64.rpm %posttrans script:
Restarting apache (all instances)
Executing %posttrans scripts .....................................................................................................[done]
Update notifications were received from the following packages:
influxdb-1.7.8-bp153.1.80.x86_64 (/var/adm/update-messages/influxdb-1.7.8-bp153.1.80)
View the notifications now? [y/n] (n): n
There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
I don't know what View the notifications
refers to and had no way of opting in here.
Result of rpmconfigcheck && for i in $(cat /var/adm/rpmconfigcheck) ; do diff -Naur ${i%.rpm*} $i | grep "^[ +-][^#;]" ; done > diff_mosd_after_leap15.3_upgrade.diff
attached.
reboot
[...]
grep VERSION=\" /etc/os-release; systemctl --failed
VERSION="15.3"
UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed
Updated by livdywan almost 3 years ago
- Status changed from In Progress to Feedback
Grafana seems to look fine
Updated by okurz almost 3 years ago
- Status changed from Feedback to Resolved
looks good, resolving as discussed together with cdywan
Updated by okurz over 2 years ago
- Copied to action #111866: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.4 added