QA (public) &raquo; openQA Project (public) &raquo; openQA Infrastructure (public)

openQA Project (public) - Ready

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

Need to upgrade workers before EOL of Leap 15.2 and have a consistent environment

Acceptance criteria¶

AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.3 (no failed systemd services, no left over .rpm-new files, etc.)
AC2: openqa-monitor runs openSUSE Leap 15.3

Suggestions¶

read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
Reserve some time when the workers are only executing a few or no openQA test jobs
Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
After upgrade reboot and check everything working as expected, if not rollback, e.g. with snapper rollback

Further details¶

Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
for reference the upgrade to openSUSE Leap 15.1 was described #55607

Files

Download all files

diff_openqaworker3_after_leap15.3_upgrade.diff (6.63 KB) diff_openqaworker3_after_leap15.3_upgrade.diff		okurz, 2021-12-17 14:32
diff_mosd_after_leap15.3_upgrade.diff (7.23 KB) diff_mosd_after_leap15.3_upgrade.diff		livdywan, 2021-12-21 12:07

Related issues 7 (0 open — 7 closed)

Related to openQA Infrastructure (public) - action #103683: [tools][sle][x86_64][aarch64][QEMUTPM] install package "swtpm" on x86_64 and aarch64 workers

Resolved

2021-12-08

2022-01-14

Related to QA (public) - action #104025: Grafana: grenache-1: partitions usage (%) alert

Resolved

2021-12-15

Related to openQA Infrastructure (public) - action #104016: Broken VirtualBox kernel module on x86_64 OSD workers

Resolved

mkittler

2021-12-15

Related to openQA Project (public) - action #104077: backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:M

Resolved

okurz

2021-12-16

Blocked by openQA Infrastructure (public) - action #104142: osd-deployment pipeline failed: File ... not found on medium

Resolved

okurz

2021-12-17

Copied from openQA Infrastructure (public) - action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2

Resolved

Copied to openQA Infrastructure (public) - action #111866: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.4

Resolved

okurz

Updated by okurz over 3 years ago

Copied from action #75238: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 added

Actions

Updated by okurz over 3 years ago

Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.2 to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3
Description updated (diff)
Assignee deleted (~~livdywan~~)
Priority changed from High to Normal

Actions

Updated by okurz over 3 years ago

Priority changed from Normal to Low

Actions

Updated by mkittler over 3 years ago

Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M
Status changed from New to Workable

Actions

Updated by kodymo over 3 years ago

Status changed from Workable to In Progress
Assignee set to kodymo

Actions

Updated by mkittler over 3 years ago

On o3 we've noticed problems (see #99189#note-15 and subsequent comments), so I suppose it makes sense to add a lock before upgrading:

zypper al qemu-ovmf-x86_64

A lock for qemu-seabios shouldn't be necessary (as of https://github.com/os-autoinst/os-autoinst/pull/1838).

Actions

Updated by okurz over 3 years ago

Status changed from In Progress to Workable

effectively not "in progress", setting back to "Workable".

Actions

Updated by okurz over 3 years ago

Priority changed from Low to High

Actions

Updated by okurz over 3 years ago

Related to action #103683: [tools][sle][x86_64][aarch64][QEMUTPM] install package "swtpm" on x86_64 and aarch64 workers added

Actions

#10

Updated by livdywan over 3 years ago

Status changed from Workable to In Progress
Assignee changed from kodymo to livdywan

Since Moritz isn't available atm and we'd really benefit from the upgrades (e.g. swtpm deps) I'm going ahead now. I'll try and upgrade through salt instead of shells on the workers as suggested in the daily.

Actions

#11

Updated by livdywan over 3 years ago

The following commands can be run via salt i.e. sudo salt -C 'G@roles:worker' cmd.run '...' where ... is the command to execute on the machine:

To confirm which workers still need upgrading: grep VERSION=\"15.2 /etc/os-release || true
To download packages: zypper --releasever=15.3 ref && zypper --releasever=15.3 -n dup --allow-vendor-change --download-only
To perform the upgrade zypper --no-refresh --releasever=15.3 -n dup --allow-vendor-change --auto-agree-with-licenses --replacefiles

To start with I'm only preparing the upgrade, which shouldn't cause much trouble

(Edit: Updated above commands after the fact in case they're copied and re-used in the future, but also mentioned below)

Actions

#12

Updated by livdywan over 3 years ago

cdywan wrote:

To perform the upgrade zypper --releasever=15.3 -n dup --allow-vendor-change --auto-agree-with-licenses --replacefiles --download-in-advance

To start with I'm only preparing the upgrade, which shouldn't cause much trouble

Much quicker than I thought. Since only 58 out of 248 are working I decided to go for it.

Actions

#13

Updated by openqa_review over 3 years ago

Due date set to 2021-12-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#14

Updated by livdywan over 3 years ago

Related to action #104025: Grafana: grenache-1: partitions usage (%) alert added

Actions

#15

Updated by livdywan over 3 years ago

Related to action #104016: Broken VirtualBox kernel module on x86_64 OSD workers added

Actions

#16

Updated by mkittler over 3 years ago

It looks like openqaworker3 remained unaffected (is still on Leap 15.2).

Note that #104016 was simply caused by a missing reboot. It would make sense to reboot the machines shortly after doing the upgrade.

Actions

#17

Updated by livdywan over 3 years ago

mkittler wrote:

It looks like openqaworker3 remained unaffected (is still on Leap 15.2).

Note that #104016 was simply caused by a missing reboot. It would make sense to reboot the machines shortly after doing the upgrade.

Ack. I was apparently mistaken that the workers would reboot daily anyway, that's why I thought I could avoid an extra reboot

Actions

#18

Updated by livdywan over 3 years ago

As suggested by @mkittler I'm now using sudo salt -C 'G@roles:worker' cmd.run 'needs-restarting --reboothint' to check if any machines still needed a reboot (and unlike sudo rebootmgrctl status it doesn't require priviledge escalation, and doesn't need me to check the running kernel)

So next step checking the remaining workers which show up as Reboot is suggested.

Actions

#19

Updated by okurz over 3 years ago

Blocked by action #104142: osd-deployment pipeline failed: File ... not found on medium added

Actions

#20

Updated by livdywan over 3 years ago

There were issues with 3 workers that I was still investigating (because, well, in alot of output it wasn't immediately clear) will update comment from notes in a minute

openqaworker-arm-1.suse.de

Rebooted twice since it was still flagged as Reboot is suggested

openqaworker-arm-2.suse.de

Rebooted, and now looking to run new packages

openqaworker3.suse.de

I couldn't identify any issues and solely re-ran dup and rebooted 🤷️

According to salt the machine is still running 15.2 despite my having completed the upgrade and ssh is unresponsive 🧐️

sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active --quiet sshd || systemctl restart sshd || systemctl status sshd'
Dec 17 13:49:22 openqaworker3 sshd-gen-keys-start[7203]: /usr/sbin/sshd-gen-keys-start: line 7: ssh-keygen: command not found

sudo salt -C 'G@roles:worker' cmd.run 'systemctl is-active --quiet sshd || zypper in -n openssh-common'
No update candidate for 'openssh-common-8.4p1-3.3.1.x86_64'. The highest available version is already installed.

So somehow ssh-keygen is unaccounted for 🤔️

Apparently we had inconsistencies because all repos weren't auto-refreshing: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/631

malbec.arch.suse.de
grenache-1.qa.suse.de

Rebooted, and now looking to run new packages

grenache also needed a little kick in the vine:

> systemctl --failed
os-autoinst-openvswitch.service loaded failed failed os-autoinst openvswitch helper
> sudo systemctl restart os-autoinst-openvswitch.service

Should document in the wiki to check if zypper reports any orphans after upgrades (#104142) i.e. zypper packages --orphaned
Consider re-installing via salt, after dup&other clean-ups i.e. salt state-apply so that we're in a state with all required packages but w/o manually installed or left-over packages that aren't in salt
~~Use -q to avoid getting lost in logs~~ -q is much too quiet 🤐️
Perform the upgrade with --no-refresh to avoid race conditions after prior package refresh and download
I didn't cover any inactive machines here

Actions

#21

Updated by okurz over 3 years ago

On openqaworker3 a problem was that zypper dup did not want to follow through originally so something was missed and we ended up with non-operative sshd after reboot.

The proper fix is in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/631 to make the SUSE_CA repo consistently auto-refresh, same as others.

We started with looking into rpm files and did salt '*' cmd.run 'test -e /etc/salt/minion.rpmnew && mv /etc/salt/minion{,.bak} && mv /etc/salt/minion{.rpmnew,}' && salt '*' state.sls_id /etc/salt/minion salt.minion

Actions

#22

Updated by okurz over 3 years ago

File diff_openqaworker3_after_leap15.3_upgrade.diff diff_openqaworker3_after_leap15.3_upgrade.diff added

complete etc file diff from openqaworker3 (except for /etc/salt/minion which we already covered) from salt 'openqaworker3.suse.de' cmd.run 'rpmconfigcheck && for i in $(cat /var/adm/rpmconfigcheck) ; do diff -Naur ${i%.rpm*} $i | grep -v \# ; done' > diff_openqaworker3_after_leap15.3_upgrade.diff attached

Actions

see https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/750090#L8454

#23

Updated by okurz over 3 years ago

Fix for kvm udev rules deployment: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/632

Actions

#24

Updated by livdywan over 3 years ago

We seem to have extra packages on openqaworker3:

> sudo salt -C 'G@roles:worker' cmd.run 'zypper packages --orphaned'
openqaworker3.suse.de:
i | @System    | libply-boot-client4     | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System    | libply-splash-core4     | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System    | libply-splash-graphics4 | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System    | libply4                 | 0.9.4+git20190304.ed9f201-lp152.4.4 | x86_64
i | @System    | libyui-ncurses11        | 2.54.5-lp152.1.3                    | x86_64
i | @System    | libyui11                | 3.9.3-lp152.1.3                     | x86_64
> sudo salt -C 'G@roles:worker' cmd.run 'zypper rm -y -u $(zypper packages --orphaned | awk "/^i/{ print $5 }" ORS=" ") hello'
> Installation has completed with error

That last message doesn't seem to contradict the fact that the packages are gone 🤓️

Actions

#25

Updated by livdywan over 3 years ago

Status changed from In Progress to Feedback

All workers seem to be in a consistent state, so I would consider this done and maybe just leave this ticket as a reference w/o trying to generelize it for now (and consider that next time)

Actions

#26

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

A related issue that came up recently: https://bugzilla.suse.com/show_bug.cgi?id=1192126

I added more steps and hints to the upgrade section on our wiki with https://progress.opensuse.org/projects/openqav3/wiki/Wiki/diff?utf8=%E2%9C%93&version=138&version_from=137&commit=View+differences . This should suffice then.

I resolved #104142 so we can resolve this one as well now.

Actions

#27

Updated by okurz over 3 years ago

Related to action #104077: backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:M added

Actions

#28

Updated by livdywan over 3 years ago

Status changed from Resolved to In Progress

okurz wrote:

AC2: openqa-monitor runs openSUSE Leap 15.3

I need to re-open the ticket. There are still outstanding issues with the monitor host which need to be addressed. This is the dup call:

( 286/1428) Removing kernel-default-5.3.18-lp152.106.1.x86_64 ...................................................................[error]
Removal of (98322)kernel-default-5.3.18-lp152.106.1.x86_64(@System) failed:
Error: Subprocess failed. Error: RPM failed: /var/tmp/rpm-tmp.AraSWV: line 1: /usr/lib/module-init-tools/kernel-scriptlets/rpm-preun: No such file or directory
error: %preun(kernel-default-5.3.18-lp152.106.1.x86_64) scriptlet failed, exit status 127
error: kernel-default-5.3.18-lp152.106.1.x86_64: erase failed

Actions

#29

Updated by livdywan over 3 years ago

File diff_mosd_after_leap15.3_upgrade.diff diff_mosd_after_leap15.3_upgrade.diff added

cdywan wrote:

okurz wrote:

AC2: openqa-monitor runs openSUSE Leap 15.3

I need to re-open the ticket. There are still outstanding issues with the monitor host which need to be addressed. This is the dup call:

( 286/1428) Removing kernel-default-5.3.18-lp152.106.1.x86_64 ...................................................................[error]
Removal of (98322)kernel-default-5.3.18-lp152.106.1.x86_64(@System) failed:
Error: Subprocess failed. Error: RPM failed: /var/tmp/rpm-tmp.AraSWV: line 1: /usr/lib/module-init-tools/kernel-scriptlets/rpm-preun: No such file or directory
error: %preun(kernel-default-5.3.18-lp152.106.1.x86_64) scriptlet failed, exit status 127
error: kernel-default-5.3.18-lp152.106.1.x86_64: erase failed

Apparently a re-run went through:

Executing %posttrans script 'kernel-firmware-amdgpu-20210208-2.4.noarch.rpm' ..................................<70%>=================[/]
Output of dmraid-1.0.0.rc16-3.26.x86_64.rpm %posttrans script:
    Updating /etc/sysconfig/dmraid ...

Output of apache2-2.4.43-3.32.1.x86_64.rpm %posttrans script:
    Restarting apache (all instances)

Executing %posttrans scripts .....................................................................................................[done]
Update notifications were received from the following packages:
influxdb-1.7.8-bp153.1.80.x86_64 (/var/adm/update-messages/influxdb-1.7.8-bp153.1.80)
View the notifications now? [y/n] (n): n
There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.
 
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.

I don't know what View the notifications refers to and had no way of opting in here.

Result of rpmconfigcheck && for i in $(cat /var/adm/rpmconfigcheck) ; do diff -Naur ${i%.rpm*} $i | grep "^[ +-][^#;]" ; done > diff_mosd_after_leap15.3_upgrade.diff attached.

reboot
[...]
grep VERSION=\" /etc/os-release; systemctl --failed
VERSION="15.3"
UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed

Actions

#30

Updated by livdywan over 3 years ago

Status changed from In Progress to Feedback

Grafana seems to look fine

Actions

#31

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

looks good, resolving as discussed together with cdywan

Actions

#32

Updated by okurz over 3 years ago

Due date deleted (~~2021-12-29~~)

Actions