action #111863
closedopenQA Project (public) - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4
Upgrade o3 workers to openSUSE Leap 15.4 size:M
Added by okurz over 2 years ago. Updated over 2 years ago.
0%
Description
Motivation¶
- Need to upgrade workers before EOL of Leap 15.3 and have a consistent environment
Acceptance criteria¶
- AC1: all o3 worker machines run a clean upgraded openSUSE Leap 15.4 (no failed systemd services, no left over .rpm-new files, etc.)
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the workers are only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- Use the instructions from above but use
transactional-update shell
for transactional update workers - After upgrade reboot and check everything working as expected, if not rollback, e.g. with
transactional-update rollback
Further details¶
- Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
Updated by okurz over 2 years ago
- Copied from action #99189: Upgrade o3 workers to openSUSE Leap 15.3 size:M added
Updated by okurz over 2 years ago
- Subject changed from Upgrade o3 workers to openSUSE Leap 15.3 size:M to Upgrade o3 workers to openSUSE Leap 15.4 size:M
- Description updated (diff)
- Assignee deleted (
mkittler) - Target version changed from Ready to future
Updated by okurz over 2 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Subject changed from Upgrade o3 workers to openSUSE Leap 15.4 size:M to Upgrade o3 workers to openSUSE Leap 15.4
Updated by okurz over 2 years ago
- Related to action #111992: Deal with QEMU and OVMF default resolution being 1280x800, affecting (at least) qxl size:M added
Updated by okurz over 2 years ago
- Status changed from New to Blocked
- Assignee set to okurz
- Target version changed from future to Ready
Updated by tinita over 2 years ago
- Status changed from Blocked to In Progress
- Assignee changed from okurz to tinita
#111992 Resolved so this is workable
Updated by openqa_review over 2 years ago
- Due date set to 2022-07-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 2 years ago
- Subject changed from Upgrade o3 workers to openSUSE Leap 15.4 to Upgrade o3 workers to openSUSE Leap 15.4 size:M
Updated by tinita over 2 years ago
I tried to upgrade openqaworker7 but got a problem:
Problem: the to be installed os-autoinst-devel-4.6.1657781352.4f16c4d-lp154.1314.2.x86_64 requires 'pkgconfig(opencv)', but this requirement cannot be provided
deleted providers: opencv-devel-3.3.1-6.6.1.x86_64
not installable providers: opencv3-devel-3.4.16-150400.1.9.x86_64[repo-oss]
Solution 1: Following actions will be done:
keep obsolete opencv-devel-3.3.1-6.6.1.x86_64
keep obsolete opencv-3.3.1-6.6.1.x86_64
Solution 2: deinstallation of opencv-devel-3.3.1-6.6.1.x86_64
Solution 3: deinstallation of os-autoinst-devel-4.6.1657727584.884de90-lp153.1313.1.x86_64
Solution 4: keep obsolete opencv-devel-3.3.1-6.6.1.x86_64
Solution 5: break os-autoinst-devel-4.6.1657781352.4f16c4d-lp154.1314.2.x86_64 by ignoring some of its dependencies
Not sure what to do.
Updated by okurz over 2 years ago
According to https://build.opensuse.org/package/show/openSUSE:Leap:15.3:Update/opencv we had opencv3 in Leap15.3, in https://build.opensuse.org/package/show/openSUSE:Leap:15.4:Update/opencv we have opencv4, same as in Tumbleweed. In https://github.com/os-autoinst/os-autoinst/blob/master/dist/rpm/os-autoinst.spec line 28 and following we have:
%if 0%{?suse_version} > 1500
# openSUSE Tumbleweed
%define opencv_require pkgconfig(opencv4)
%else
%define opencv_require pkgconfig(opencv)
%endif
so likely we need to update this according to https://en.opensuse.org/openSUSE:Build_Service_cross_distribution_howto#Detect_a_distribution_flavor_for_special_code to make sure users upgrading from Leap 15.3 to 15.4 have an automatic resolution.
I suggest to reproduce in a clean container environment, e.g. podman run --rm -it registry.opensuse.org/opensuse/leap:15.3, and see how we can upgrade. At best we would also provide a maintenance update to the vanilla version in os-autoinst. I will create a new ticket for that specifically.
Updated by okurz over 2 years ago
- Copied to action #113662: Ensure upgrade of vanilla Leap 15.3 os-autoinst can be automatically upgraded to Leap 15.4 size:S added
Updated by livdywan over 2 years ago
- Ondrėj suggests to add a version requirement for opencv3-devel based on the distribution version, which was called opencv-devel on older releases, in the os-autoinst spec file.
- Maybe it's enough to decrease
%if 0%{?suse_version} > 1500
to use a higher version? See https://github.com/os-autoinst/os-autoinst/blob/master/dist/rpm/os-autoinst.spec
Updated by favogt over 2 years ago
I wonder what os-autoinst-devel
is actually needed for on workers. On ow7 it can apparently be removed without breaking any dependencies.
ow7 was also installed with 15.4 already (not upgraded) and has opencv3-devel
installed to provide pkgconfig(opencv)
. So to get the same setup on ow4, solution 2 would be correct.
If switching os-autoinst on 15.4 to OpenCV 4 is ok, then that is probably the best option. That shouldn't require any further changes; during the upgrade it would simply stay with opencv-devel, upgrading from 3 to 4.
Updated by okurz over 2 years ago
@tinita this ticket is not blocked by #113662 which is only about the vanilla version of os-autoinst, i.e. not from devel:openQA. On o3 workers we run from devel:openQA so we can expect likely the same fix but have it available much sooner and no need for a maintenance update.
favogt wrote:
I wonder what
os-autoinst-devel
is actually needed for on workers. On ow7 it can apparently be removed without breaking any dependencies.
You are right, os-autoinst-devel is not needed for the operation of workers, just necessary to build our own. Why we have that on workers I don't know, maybe was done for experiments, maybe for the external video encoder?
If I would uninstall with dependencies removed as well then it would look like this:
okurz@ariel:~> for i in aarch64 openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do echo "## $i" && ssh root@$i "zypper rm --clean-deps --dry-run os-autoinst-devel"; done
## aarch64
Reading installed packages...
Resolving package dependencies...
The following 102 packages are going to be REMOVED:
aspell aspell-en aspell-spell cmake cmake-full cmake-man fftw3-devel gcc7-c++ gcc-c++ glu-devel kbproto-devel libenca0 libfftw3_threads3 libglvnd-devel libguess1 libICE-devel libjsoncpp19 libogg-devel libpng16-compat-devel libpng16-devel librcc0 librcd0 librhash0 libSM-devel libsndfile-devel libstdc++6-devel-gcc7 libstdc++-devel libtheora0 libtheora-devel libX11-devel libXau-devel libxcb-composite0 libxcb-damage0 libxcb-devel libxcb-dpms0 libxcb-record0 libxcb-res0 libxcb-screensaver0 libxcb-xf86dri0 libxcb-xtest0 libxcb-xv0 libxcb-xvmc0 libXext-devel Mesa-KHR-devel Mesa-libGL-devel ninja opencv opencv-devel os-autoinst-devel perl-Algorithm-Diff perl-AppConfig perl-B-Debug perl-Browser-Open perl-Class-Accessor-Lite perl-Code-TidyAll perl-Config-INI perl-Devel-Cover perl-Devel-Cover-Report-Codecov perl-Export-Attrs perl-Exporter-Lite perl-File-pushd perl-Furl perl-HTTP-Parser-XS perl-IPC-Run3 perl-JSON-MaybeXS perl-List-Compare perl-List-SomeUtils perl-List-SomeUtils-XS perl-Log-Any perl-Mixin-Linewise perl-Module-Find perl-Net-IDN-Encode perl-PerlIO-utf8_strict perl-Scope-Guard perl-Sereal-Decoder perl-Sereal-Encoder perl-Specio-Library-Path-Tiny perl-Sub-Retry perl-Sub-Uplevel perl-SUPER perl-Template-Toolkit perl-Test-Differences perl-Test-Exception perl-Test-MockModule perl-Test-MockObject perl-Test-MockRandom perl-Test-Mock-Time perl-Test-Most perl-Test-Output perl-Test-Strict perl-Test-Warn perl-Test-Warnings perl-Text-Diff perl-Time-Duration-Parse perl-UNIVERSAL-can perl-UNIVERSAL-isa pthread-stubs-devel rcc-runtime ShellCheck xextproto-devel xproto-devel zlib-devel
…
## power8
Reading installed packages...
Package 'os-autoinst-devel' not found.Resolving package dependencies...
Nothing to do.
So removing os-autoinst-devel and carefully removing unneeded dependencies is also fine for me.
Updated by okurz over 2 years ago
@osukup please observe https://github.com/os-autoinst/os-autoinst/pull/2125#issuecomment-1185632899
Testing an according maintenance request with podman run --rm -it registry.opensuse.org/opensuse/leap:15.3 /bin/sh -c 'zypper -n ar -f https://download.opensuse.org/repositories/home:/okurz:/branches:/openSUSE:/Backports:/SLE-15-SP4:/Update/standard/home:okurz:branches:openSUSE:Backports:SLE-15-SP4:Update.repo && zypper -n --gpg-auto-import-keys ref && zypper -n up && zypper -n in os-autoinst-devel && zypper --releasever=15.4 -n dup'
I ran into
Problem: nothing provides 'pkgconfig(opencv4)' needed by the to be installed os-autoinst-devel-4.6.1639403953.ae94c4bd-bp154.4.1.x86_64
Solution 1: do not install os-autoinst-devel-4.6.1639403953.ae94c4bd-bp154.4.1.x86_64
Solution 2: break os-autoinst-devel-4.6.1639403953.ae94c4bd-bp154.4.1.x86_64 by ignoring some of its dependencies
so seems like something is still missing?
EDIT: As commented on https://build.opensuse.org/request/show/989570#comment-1653437 we just need to accept the conflict and resolve manually whenever hit.
Updated by tinita over 2 years ago
Upgraded aarch64 openqaworker1 openqaworker7 power8 imagetester rebel
Notes:
/etc/zypp/zypp.conf
Suggestion was to add:
multiversion = provides:multiversion(kernel)
but we kept the empty value
/etc/postfix/main.cf
inet_interfaces = all
inet_protocols = ipv4
masquerade_exceptions =
vs. rpmsave
inet_interfaces = localhost
inet_protocols = all
masquerade_exceptions = root
We kept the automatic changes
On openqaworker4we changed:
/etc/zypp/zypp.conf
old:
solver.dupAllowVendorChange = true
new:
solver.dupAllowVendorChange = true
/etc/containers/storage.conf
# Default Storage Driver
driver = "btrfs"
rpmnew suggests:
driver = "overlay"
we kept the old value
Updated by mkittler over 2 years ago
I'm adding noauto,nofail,retry=30,ro,x-systemd.automount,x-systemd.device-timeout=10m,x-systemd.mount-timeout=30m
as config for the NFS mount on all o3 workers (in accordance with OSD workers) as the mount unit failed after rebooting aarch64 after the upgrade (because it couldn't resolve the hostname). I excluded rebel as it has no nfs mount anyways and openqaworker7 as it couldn't boot at all.
Updated by tinita over 2 years ago
Leftovers:
power8¶
power8:~ # systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION >
● systemd-hibernate-resume@dev-disk-by\x2did-scsi\x2d1IBM_IPR\x2d0_5DB3D30000000020\x2dpart2.service loaded failed failed Resume from hibernation using device /dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020-p>
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
Find out about this failure.
openqaworker7¶
does not boot
Updated by okurz over 2 years ago
tinita wrote:
Leftovers:
power8¶
power8:~ # systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION > ● systemd-hibernate-resume@dev-disk-by\x2did-scsi\x2d1IBM_IPR\x2d0_5DB3D30000000020\x2dpart2.service loaded failed failed Resume from hibernation using device /dev/disk/by-id/scsi-1IBM_IPR-0_5DB3D30000000020-p> LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed.
Find out about this failure.
It is not a new problem but has been seen for very long. It's just that we couldn't completely mask this service so far.
EDIT: I tried again with masking the service and triggering a reboot. Did not work. I removed all mentions of "resume=UUID…" in /etc and /boot. Rebooted, no failed service. That seems to have helped.
openqaworker7¶
does not boot
it does boot just that it takes about 10 minutes due to various firmware initialization, PXE boot menu with timeout, then falling back to some other init until it reaches the grub menu where there is again a timeout until it boots the OS.
Updated by mkittler over 2 years ago
openqaworker7 is meanwhile reachable via SSH, the only failed service is snapper-cleanup.service
Updated by okurz over 2 years ago
- Due date deleted (
2022-07-28) - Status changed from In Progress to Resolved
mkittler wrote:
openqaworker7 is meanwhile reachable via SSH, the only failed service is
snapper-cleanup.service
openqaworker7 has an excessive amount of btrfs snapshots, for i in aarch64 openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do echo "## $i" && ssh root@$i "snapper ls | wc -l"; done
yields:
## aarch64
26
## openqaworker1
23
## openqaworker4
20
## openqaworker7
1390
## imagetester
19
## rebel
23
## power8
The config 'root' does not exist. Likely snapper is not configured.
See 'man snapper' for further instructions.
0
I think we had this problem already some time ago related to podman and btrfs subvolumes occupied and preventing a proper cleanup. Yep, there was #102942 and #65073, in both cases related to some nested subvolumes that snapper fails/refuses to delete. So manual cleanup it is:
btrfs subvolume list / | grep subvolumes | sed -n 's/^.*path @//p' | xargs btrfs subvolume delete && systemctl start snapper-cleanup
This is very slowly churning down the number of snapshots but seems to take some time. Anyway, overall we seem to be good:
for i in aarch64 openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do echo "## $i" && ssh root@$i "systemctl is-system-running; rpmconfigcheck"; done
Updated by okurz over 1 year ago
- Copied to action #130585: Upgrade o3 workers to openSUSE Leap 15.5 added