Project

General

Profile

action #111863

openQA Project - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4

Upgrade o3 workers to openSUSE Leap 15.4 size:M

Added by okurz 3 months ago. Updated 27 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

  • Need to upgrade workers before EOL of Leap 15.3 and have a consistent environment

Acceptance criteria

  • AC1: all o3 worker machines run a clean upgraded openSUSE Leap 15.4 (no failed systemd services, no left over .rpm-new files, etc.)

Suggestions

  • read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
  • Reserve some time when the workers are only executing a few or no openQA test jobs
  • Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
  • Use the instructions from above but use transactional-update shell for transactional update workers
  • After upgrade reboot and check everything working as expected, if not rollback, e.g. with transactional-update rollback

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

Related issues

Related to openQA Project - action #111992: Deal with QEMU and OVMF default resolution being 1280x800, affecting (at least) qxl size:MBlocked2022-06-03

Copied from openQA Infrastructure - action #99189: Upgrade o3 workers to openSUSE Leap 15.3 size:MResolved

Copied to openQA Infrastructure - action #113662: Ensure upgrade of vanilla Leap 15.3 os-autoinst can be automatically upgraded to Leap 15.4 size:SResolved

History

#1 Updated by okurz 3 months ago

  • Copied from action #99189: Upgrade o3 workers to openSUSE Leap 15.3 size:M added

#2 Updated by okurz 3 months ago

  • Subject changed from Upgrade o3 workers to openSUSE Leap 15.3 size:M to Upgrade o3 workers to openSUSE Leap 15.4 size:M
  • Description updated (diff)
  • Assignee deleted (mkittler)
  • Target version changed from Ready to future

#3 Updated by okurz 3 months ago

  • Project changed from openQA Project to openQA Infrastructure
  • Subject changed from Upgrade o3 workers to openSUSE Leap 15.4 size:M to Upgrade o3 workers to openSUSE Leap 15.4

#4 Updated by okurz about 1 month ago

  • Related to action #111992: Deal with QEMU and OVMF default resolution being 1280x800, affecting (at least) qxl size:M added

#5 Updated by okurz about 1 month ago

  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version changed from future to Ready

#6 Updated by tinita about 1 month ago

  • Status changed from Blocked to In Progress
  • Assignee changed from okurz to tinita

#111992 Resolved so this is workable

#7 Updated by tinita about 1 month ago

  • Description updated (diff)

#8 Updated by openqa_review about 1 month ago

  • Due date set to 2022-07-28

Setting due date based on mean cycle time of SUSE QE Tools

#9 Updated by mkittler about 1 month ago

  • Subject changed from Upgrade o3 workers to openSUSE Leap 15.4 to Upgrade o3 workers to openSUSE Leap 15.4 size:M

#10 Updated by tinita about 1 month ago

I tried to upgrade openqaworker7 but got a problem:

Problem: the to be installed os-autoinst-devel-4.6.1657781352.4f16c4d-lp154.1314.2.x86_64 requires 'pkgconfig(opencv)', but this requirement cannot be provided
  deleted providers: opencv-devel-3.3.1-6.6.1.x86_64
not installable providers: opencv3-devel-3.4.16-150400.1.9.x86_64[repo-oss]
 Solution 1: Following actions will be done:
  keep obsolete opencv-devel-3.3.1-6.6.1.x86_64
  keep obsolete opencv-3.3.1-6.6.1.x86_64
 Solution 2: deinstallation of opencv-devel-3.3.1-6.6.1.x86_64
 Solution 3: deinstallation of os-autoinst-devel-4.6.1657727584.884de90-lp153.1313.1.x86_64
 Solution 4: keep obsolete opencv-devel-3.3.1-6.6.1.x86_64
 Solution 5: break os-autoinst-devel-4.6.1657781352.4f16c4d-lp154.1314.2.x86_64 by ignoring some of its dependencies

Not sure what to do.

#11 Updated by okurz about 1 month ago

According to https://build.opensuse.org/package/show/openSUSE:Leap:15.3:Update/opencv we had opencv3 in Leap15.3, in https://build.opensuse.org/package/show/openSUSE:Leap:15.4:Update/opencv we have opencv4, same as in Tumbleweed. In https://github.com/os-autoinst/os-autoinst/blob/master/dist/rpm/os-autoinst.spec line 28 and following we have:

%if 0%{?suse_version} > 1500
# openSUSE Tumbleweed
%define opencv_require pkgconfig(opencv4)
%else
%define opencv_require pkgconfig(opencv)
%endif

so likely we need to update this according to https://en.opensuse.org/openSUSE:Build_Service_cross_distribution_howto#Detect_a_distribution_flavor_for_special_code to make sure users upgrading from Leap 15.3 to 15.4 have an automatic resolution.

I suggest to reproduce in a clean container environment, e.g. podman run --rm -it registry.opensuse.org/opensuse/leap:15.3, and see how we can upgrade. At best we would also provide a maintenance update to the vanilla version in os-autoinst. I will create a new ticket for that specifically.

#12 Updated by okurz about 1 month ago

  • Copied to action #113662: Ensure upgrade of vanilla Leap 15.3 os-autoinst can be automatically upgraded to Leap 15.4 size:S added

#13 Updated by cdywan about 1 month ago

#14 Updated by favogt about 1 month ago

I wonder what os-autoinst-devel is actually needed for on workers. On ow7 it can apparently be removed without breaking any dependencies.

ow7 was also installed with 15.4 already (not upgraded) and has opencv3-devel installed to provide pkgconfig(opencv). So to get the same setup on ow4, solution 2 would be correct.

If switching os-autoinst on 15.4 to OpenCV 4 is ok, then that is probably the best option. That shouldn't require any further changes; during the upgrade it would simply stay with opencv-devel, upgrading from 3 to 4.

#15 Updated by tinita about 1 month ago

Blocked by #113662

#16 Updated by okurz about 1 month ago

tinita this ticket is not blocked by #113662 which is only about the vanilla version of os-autoinst, i.e. not from devel:openQA. On o3 workers we run from devel:openQA so we can expect likely the same fix but have it available much sooner and no need for a maintenance update.

favogt wrote:

I wonder what os-autoinst-devel is actually needed for on workers. On ow7 it can apparently be removed without breaking any dependencies.

You are right, os-autoinst-devel is not needed for the operation of workers, just necessary to build our own. Why we have that on workers I don't know, maybe was done for experiments, maybe for the external video encoder?

If I would uninstall with dependencies removed as well then it would look like this:

okurz@ariel:~> for i in aarch64 openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do echo "## $i" && ssh root@$i "zypper rm --clean-deps --dry-run os-autoinst-devel"; done
## aarch64
Reading installed packages...
Resolving package dependencies...

The following 102 packages are going to be REMOVED:
  aspell aspell-en aspell-spell cmake cmake-full cmake-man fftw3-devel gcc7-c++ gcc-c++ glu-devel kbproto-devel libenca0 libfftw3_threads3 libglvnd-devel libguess1 libICE-devel libjsoncpp19 libogg-devel libpng16-compat-devel libpng16-devel librcc0 librcd0 librhash0 libSM-devel libsndfile-devel libstdc++6-devel-gcc7 libstdc++-devel libtheora0 libtheora-devel libX11-devel libXau-devel libxcb-composite0 libxcb-damage0 libxcb-devel libxcb-dpms0 libxcb-record0 libxcb-res0 libxcb-screensaver0 libxcb-xf86dri0 libxcb-xtest0 libxcb-xv0 libxcb-xvmc0 libXext-devel Mesa-KHR-devel Mesa-libGL-devel ninja opencv opencv-devel os-autoinst-devel perl-Algorithm-Diff perl-AppConfig perl-B-Debug perl-Browser-Open perl-Class-Accessor-Lite perl-Code-TidyAll perl-Config-INI perl-Devel-Cover perl-Devel-Cover-Report-Codecov perl-Export-Attrs perl-Exporter-Lite perl-File-pushd perl-Furl perl-HTTP-Parser-XS perl-IPC-Run3 perl-JSON-MaybeXS perl-List-Compare perl-List-SomeUtils perl-List-SomeUtils-XS perl-Log-Any perl-Mixin-Linewise perl-Module-Find perl-Net-IDN-Encode perl-PerlIO-utf8_strict perl-Scope-Guard perl-Sereal-Decoder perl-Sereal-Encoder perl-Specio-Library-Path-Tiny perl-Sub-Retry perl-Sub-Uplevel perl-SUPER perl-Template-Toolkit perl-Test-Differences perl-Test-Exception perl-Test-MockModule perl-Test-MockObject perl-Test-MockRandom perl-Test-Mock-Time perl-Test-Most perl-Test-Output perl-Test-Strict perl-Test-Warn perl-Test-Warnings perl-Text-Diff perl-Time-Duration-Parse perl-UNIVERSAL-can perl-UNIVERSAL-isa pthread-stubs-devel rcc-runtime ShellCheck xextproto-devel xproto-devel zlib-devel
…
## power8
Reading installed packages...
Package 'os-autoinst-devel' not found.Resolving package dependencies...

Nothing to do.

So removing os-autoinst-devel and carefully removing unneeded dependencies is also fine for me.

#17 Updated by okurz about 1 month ago

osukup please observe https://github.com/os-autoinst/os-autoinst/pull/2125#issuecomment-1185632899

Testing an according maintenance request with podman run --rm -it registry.opensuse.org/opensuse/leap:15.3 /bin/sh -c 'zypper -n ar -f https://download.opensuse.org/repositories/home:/okurz:/branches:/openSUSE:/Backports:/SLE-15-SP4:/Update/standard/home:okurz:branches:openSUSE:Backports:SLE-15-SP4:Update.repo && zypper -n --gpg-auto-import-keys ref && zypper -n up && zypper -n in os-autoinst-devel && zypper --releasever=15.4 -n dup'

I ran into

Problem: nothing provides 'pkgconfig(opencv4)' needed by the to be installed os-autoinst-devel-4.6.1639403953.ae94c4bd-bp154.4.1.x86_64
 Solution 1: do not install os-autoinst-devel-4.6.1639403953.ae94c4bd-bp154.4.1.x86_64
 Solution 2: break os-autoinst-devel-4.6.1639403953.ae94c4bd-bp154.4.1.x86_64 by ignoring some of its dependencies

so seems like something is still missing?

EDIT: As commented on https://build.opensuse.org/request/show/989570#comment-1653437 we just need to accept the conflict and resolve manually whenever hit.

#18 Updated by tinita 27 days ago

Upgraded aarch64 openqaworker1 openqaworker7 power8 imagetester rebel

Notes:

/etc/zypp/zypp.conf
Suggestion was to add:
    multiversion = provides:multiversion(kernel)
but we kept the empty value

/etc/postfix/main.cf
    inet_interfaces = all
    inet_protocols = ipv4
    masquerade_exceptions =
vs. rpmsave
    inet_interfaces = localhost
    inet_protocols = all
    masquerade_exceptions = root

We kept the automatic changes

On openqaworker4we changed:
/etc/zypp/zypp.conf
old:
solver.dupAllowVendorChange = true
new:
solver.dupAllowVendorChange = true


/etc/containers/storage.conf
# Default Storage Driver
driver = "btrfs"
rpmnew suggests:
driver = "overlay"

we kept the old value

#19 Updated by mkittler 27 days ago

I'm adding noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m as config for the NFS mount on all o3 workers (in accordance with OSD workers) as the mount unit failed after rebooting aarch64 after the upgrade (because it couldn't resolve the hostname). I excluded rebel as it has no nfs mount anyways and openqaworker7 as it couldn't boot at all.

#20 Updated by tinita 27 days ago

Leftovers:

power8

power8:~ # systemctl --failed
  UNIT                                                                                               LOAD   ACTIVE SUB    DESCRIPTION                                                                            >
● systemd-hibernate-resume@dev-disk-by\x2did-scsi\x2d1IBM_IPR\x2d0_5DB3D30000000020\x2dpart2.service loaded failed failed Resume from hibernation using device /dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020-p>

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
1 loaded units listed.

Find out about this failure.

openqaworker7

does not boot

#21 Updated by okurz 27 days ago

tinita wrote:

Leftovers:

power8

power8:~ # systemctl --failed
  UNIT                                                                                               LOAD   ACTIVE SUB    DESCRIPTION                                                                            >
● systemd-hibernate-resume@dev-disk-by\x2did-scsi\x2d1IBM_IPR\x2d0_5DB3D30000000020\x2dpart2.service loaded failed failed Resume from hibernation using device /dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020-p>

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
1 loaded units listed.

Find out about this failure.

It is not a new problem but has been seen for very long. It's just that we couldn't completely mask this service so far.

EDIT: I tried again with masking the service and triggering a reboot. Did not work. I removed all mentions of "resume=UUID…" in /etc and /boot. Rebooted, no failed service. That seems to have helped.

openqaworker7

does not boot

it does boot just that it takes about 10 minutes due to various firmware initialization, PXE boot menu with timeout, then falling back to some other init until it reaches the grub menu where there is again a timeout until it boots the OS.

#22 Updated by mkittler 27 days ago

openqaworker7 is meanwhile reachable via SSH, the only failed service is snapper-cleanup.service

#23 Updated by okurz 27 days ago

  • Due date deleted (2022-07-28)
  • Status changed from In Progress to Resolved

mkittler wrote:

openqaworker7 is meanwhile reachable via SSH, the only failed service is snapper-cleanup.service

openqaworker7 has an excessive amount of btrfs snapshots, for i in aarch64 openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do echo "## $i" && ssh root@$i "snapper ls | wc -l"; done yields:

## aarch64
26
## openqaworker1
23
## openqaworker4
20
## openqaworker7
1390
## imagetester
19
## rebel
23
## power8
The config 'root' does not exist. Likely snapper is not configured.
See 'man snapper' for further instructions.
0

I think we had this problem already some time ago related to podman and btrfs subvolumes occupied and preventing a proper cleanup. Yep, there was #102942 and #65073, in both cases related to some nested subvolumes that snapper fails/refuses to delete. So manual cleanup it is:

btrfs subvolume list / | grep subvolumes | sed -n 's/^.*path @//p' | xargs btrfs subvolume delete && systemctl start snapper-cleanup

This is very slowly churning down the number of snapshots but seems to take some time. Anyway, overall we seem to be good:

for i in aarch64 openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do echo "## $i" && ssh root@$i "systemctl is-system-running; rpmconfigcheck"; done

Also available in: Atom PDF