Project

General

Profile

action #99192

coordination #99183: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui, to openSUSE Leap 15.3

Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M

Added by okurz 8 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

  • Need to upgrade workers before EOL of Leap 15.2 and have a consistent environment

Acceptance criteria

  • AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.3 (no failed systemd services, no left over .rpm-new files, etc.)
  • AC2: openqa-monitor runs openSUSE Leap 15.3

Suggestions

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

  • for reference the upgrade to openSUSE Leap 15.1 was described #55607


Related issues

Related to openQA Infrastructure - action #103683: [tools][sle][x86_64][aarch64][QEMUTPM] install package "swtpm" on x86_64 and aarch64 workersResolved2021-12-082022-01-14

Related to QA - action #104025: Grafana: grenache-1: partitions usage (%) alertResolved2021-12-15

Related to openQA Infrastructure - action #104016: Broken VirtualBox kernel module on x86_64 OSD workersResolved2021-12-15

Related to openQA Project - action #104077: backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:MResolved2021-12-16

Blocked by openQA Infrastructure - action #104142: osd-deployment pipeline failed: File ... not found on mediumResolved2021-12-17

Copied from openQA Infrastructure - action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2Resolved

History

#1 Updated by okurz 8 months ago

  • Copied from action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 added

#2 Updated by okurz 8 months ago

  • Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3
  • Description updated (diff)
  • Assignee deleted (cdywan)
  • Priority changed from High to Normal

#3 Updated by okurz 8 months ago

  • Priority changed from Normal to Low

#4 Updated by mkittler 7 months ago

  • Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M
  • Status changed from New to Workable

#5 Updated by kodymo 6 months ago

  • Status changed from Workable to In Progress
  • Assignee set to kodymo

#6 Updated by mkittler 6 months ago

On o3 we've noticed problems (see #99189#note-15 and subsequent comments), so I suppose it makes sense to add a lock before upgrading:

zypper al qemu-ovmf-x86_64

A lock for qemu-seabios shouldn't be necessary (as of https://github.com/os-autoinst/os-autoinst/pull/1838).

#7 Updated by okurz 6 months ago

  • Status changed from In Progress to Workable

effectively not "in progress", setting back to "Workable".

#8 Updated by okurz 6 months ago

  • Priority changed from Low to High

#9 Updated by okurz 6 months ago

  • Related to action #103683: [tools][sle][x86_64][aarch64][QEMUTPM] install package "swtpm" on x86_64 and aarch64 workers added

#10 Updated by cdywan 5 months ago

  • Status changed from Workable to In Progress
  • Assignee changed from kodymo to cdywan

Since Moritz isn't available atm and we'd really benefit from the upgrades (e.g. swtpm deps) I'm going ahead now. I'll try and upgrade through salt instead of shells on the workers as suggested in the daily.

#11 Updated by cdywan 5 months ago

The following commands can be run via salt i.e. sudo salt -C 'G@roles:worker' cmd.run '...' where ... is the command to execute on the machine:

  • To confirm which workers still need upgrading: grep VERSION=\"15.2 /etc/os-release || true
  • To download packages: zypper --releasever=15.3 ref && zypper --releasever=15.3 -n dup --allow-vendor-change --download-only
  • To perform the upgrade zypper --no-refresh --releasever=15.3 -n dup --allow-vendor-change --auto-agree-with-licenses --replacefiles

To start with I'm only preparing the upgrade, which shouldn't cause much trouble

(Edit: Updated above commands after the fact in case they're copied and re-used in the future, but also mentioned below)

#12 Updated by cdywan 5 months ago

cdywan wrote:

  • To perform the upgrade zypper --releasever=15.3 -n dup --allow-vendor-change --auto-agree-with-licenses --replacefiles --download-in-advance

To start with I'm only preparing the upgrade, which shouldn't cause much trouble

Much quicker than I thought. Since only 58 out of 248 are working I decided to go for it.

#13 Updated by openqa_review 5 months ago

  • Due date set to 2021-12-29

Setting due date based on mean cycle time of SUSE QE Tools

#14 Updated by cdywan 5 months ago

  • Related to action #104025: Grafana: grenache-1: partitions usage (%) alert added

#15 Updated by cdywan 5 months ago

  • Related to action #104016: Broken VirtualBox kernel module on x86_64 OSD workers added

#16 Updated by mkittler 5 months ago

It looks like openqaworker3 remained unaffected (is still on Leap 15.2).

Note that #104016 was simply caused by a missing reboot. It would make sense to reboot the machines shortly after doing the upgrade.

#17 Updated by cdywan 5 months ago

mkittler wrote:

It looks like openqaworker3 remained unaffected (is still on Leap 15.2).

Note that #104016 was simply caused by a missing reboot. It would make sense to reboot the machines shortly after doing the upgrade.

Ack. I was apparently mistaken that the workers would reboot daily anyway, that's why I thought I could avoid an extra reboot

#18 Updated by cdywan 5 months ago

As suggested by mkittler I'm now using sudo salt -C 'G@roles:worker' cmd.run 'needs-restarting --reboothint' to check if any machines still needed a reboot (and unlike sudo rebootmgrctl status it doesn't require priviledge escalation, and doesn't need me to check the running kernel)

So next step checking the remaining workers which show up as Reboot is suggested.

#19 Updated by okurz 5 months ago

  • Blocked by action #104142: osd-deployment pipeline failed: File ... not found on medium added

#20 Updated by cdywan 5 months ago

There were issues with 3 workers that I was still investigating (because, well, in alot of output it wasn't immediately clear) will update comment from notes in a minute

openqaworker-arm-1.suse.de

Rebooted twice since it was still flagged as Reboot is suggested

openqaworker-arm-2.suse.de

Rebooted, and now looking to run new packages

openqaworker3.suse.de

I couldn't identify any issues and solely re-ran dup and rebooted 🤷️

According to salt the machine is still running 15.2 despite my having completed the upgrade and ssh is unresponsive 🧐️

sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active --quiet sshd || systemctl restart sshd || systemctl status sshd'
Dec 17 13:49:22 openqaworker3 sshd-gen-keys-start[7203]: /usr/sbin/sshd-gen-keys-start: line 7: ssh-keygen: command not found

sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active --quiet sshd || zypper in -n openssh-common'
No update candidate for 'openssh-common-8.4p1-3.3.1.x86_64'. The highest available version is already installed.

So somehow ssh-keygen is unaccounted for 🤔️

Apparently we had inconsistencies because all repos weren't auto-refreshing: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/631

malbec.arch.suse.de
grenache-1.qa.suse.de

Rebooted, and now looking to run new packages

grenache also needed a little kick in the vine:

> systemctl --failed
os-autoinst-openvswitch.service loaded failed failed os-autoinst openvswitch helper
> sudo systemctl restart os-autoinst-openvswitch.service
  • Should document in the wiki to check if zypper reports any orphans after upgrades (#104142) i.e. zypper packages --orphaned
  • Consider re-installing via salt, after dup&other clean-ups i.e. salt state-apply so that we're in a state with all required packages but w/o manually installed or left-over packages that aren't in salt
  • Use -q to avoid getting lost in logs -q is much too quiet 🤐️
  • Perform the upgrade with --no-refresh to avoid race conditions after prior package refresh and download
  • I didn't cover any inactive machines here

#21 Updated by okurz 5 months ago

On openqaworker3 a problem was that zypper dup did not want to follow through originally so something was missed and we ended up with non-operative sshd after reboot.

The proper fix is in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/631 to make the SUSE_CA repo consistently auto-refresh, same as others.

We started with looking into rpm files and did salt '*' cmd.run 'test -e /etc/salt/minion.rpmnew && mv /etc/salt/minion{,.bak} && mv /etc/salt/minion{.rpmnew,}' && salt '*' state.sls_id /etc/salt/minion salt.minion

#22 Updated by okurz 5 months ago

complete etc file diff from openqaworker3 (except for /etc/salt/minion which we already covered) from salt 'openqaworker3.suse.de' cmd.run 'rpmconfigcheck && for i in $(cat /var/adm/rpmconfigcheck) ; do diff -Naur ${i%.rpm*} $i | grep -v \# ; done' > diff_openqaworker3_after_leap15.3_upgrade.diff attached

#24 Updated by cdywan 5 months ago

We seem to have extra packages on openqaworker3:

> sudo salt -C 'G@roles:worker' cmd.run 'zypper packages --orphaned'
openqaworker3.suse.de:
i | @System    | libply-boot-client4     | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System    | libply-splash-core4     | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System    | libply-splash-graphics4 | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System    | libply4                 | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System    | libyui-ncurses11        | 2.54.5-lp152.1.3                    | x86_64
i | @System    | libyui11                | 3.9.3-lp152.1.3                     | x86_64
> sudo salt -C 'G@roles:worker' cmd.run 'zypper rm -y -u $(zypper packages --orphaned | awk "/^i/{ print $5 }" ORS=" ") hello'
> Installation has completed with error

That last message doesn't seem to contradict the fact that the packages are gone 🤓️

#25 Updated by cdywan 5 months ago

  • Status changed from In Progress to Feedback

All workers seem to be in a consistent state, so I would consider this done and maybe just leave this ticket as a reference w/o trying to generelize it for now (and consider that next time)

#26 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

A related issue that came up recently: https://bugzilla.suse.com/show_bug.cgi?id=1192126

I added more steps and hints to the upgrade section on our wiki with https://progress.opensuse.org/projects/openqav3/wiki/Wiki/diff?utf8=%E2%9C%93&version=138&version_from=137&commit=View+differences . This should suffice then.

I resolved #104142 so we can resolve this one as well now.

#27 Updated by okurz 5 months ago

  • Related to action #104077: backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:M added

#28 Updated by cdywan 5 months ago

  • Status changed from Resolved to In Progress

okurz wrote:

  • AC2: openqa-monitor runs openSUSE Leap 15.3

I need to re-open the ticket. There are still outstanding issues with the monitor host which need to be addressed. This is the dup call:

( 286/1428) Removing kernel-default-5.3.18-lp152.106.1.x86_64 ...................................................................[error]
Removal of (98322)kernel-default-5.3.18-lp152.106.1.x86_64(@System) failed:
Error: Subprocess failed. Error: RPM failed: /var/tmp/rpm-tmp.AraSWV: line 1: /usr/lib/module-init-tools/kernel-scriptlets/rpm-preun: No such file or directory
error: %preun(kernel-default-5.3.18-lp152.106.1.x86_64) scriptlet failed, exit status 127
error: kernel-default-5.3.18-lp152.106.1.x86_64: erase failed

#29 Updated by cdywan 5 months ago

cdywan wrote:

okurz wrote:

  • AC2: openqa-monitor runs openSUSE Leap 15.3

I need to re-open the ticket. There are still outstanding issues with the monitor host which need to be addressed. This is the dup call:

( 286/1428) Removing kernel-default-5.3.18-lp152.106.1.x86_64 ...................................................................[error]
Removal of (98322)kernel-default-5.3.18-lp152.106.1.x86_64(@System) failed:
Error: Subprocess failed. Error: RPM failed: /var/tmp/rpm-tmp.AraSWV: line 1: /usr/lib/module-init-tools/kernel-scriptlets/rpm-preun: No such file or directory
error: %preun(kernel-default-5.3.18-lp152.106.1.x86_64) scriptlet failed, exit status 127
error: kernel-default-5.3.18-lp152.106.1.x86_64: erase failed

Apparently a re-run went through:

Executing %posttrans script 'kernel-firmware-amdgpu-20210208-2.4.noarch.rpm' ..................................<70%>=================[/]
Output of dmraid-1.0.0.rc16-3.26.x86_64.rpm %posttrans script:
    Updating /etc/sysconfig/dmraid ...

Output of apache2-2.4.43-3.32.1.x86_64.rpm %posttrans script:
    Restarting apache (all instances)

Executing %posttrans scripts .....................................................................................................[done]
Update notifications were received from the following packages:
influxdb-1.7.8-bp153.1.80.x86_64 (/var/adm/update-messages/influxdb-1.7.8-bp153.1.80)
View the notifications now? [y/n] (n): n
There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.

Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.

I don't know what View the notifications refers to and had no way of opting in here.

Result of rpmconfigcheck && for i in $(cat /var/adm/rpmconfigcheck) ; do diff -Naur ${i%.rpm*} $i | grep "^[ +-][^#;]" ; done > diff_mosd_after_leap15.3_upgrade.diff attached.

reboot
[...]
grep VERSION=\" /etc/os-release; systemctl --failed
VERSION="15.3"
UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed

#30 Updated by cdywan 5 months ago

  • Status changed from In Progress to Feedback

Grafana seems to look fine

#31 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

looks good, resolving as discussed together with cdywan

#32 Updated by okurz 5 months ago

  • Due date deleted (2021-12-29)

Also available in: Atom PDF