action #157996
closedcoordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.6
0%
Description
Motivation¶
- Need to upgrade machines before EOL of Leap 15.5 and have a consistent environment
Acceptance criteria¶
- AC1: all LSG QE salt controlled machines run a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.) except for OSD workers
Acceptance tests¶
- AT1-1:
sudo salt -C 'not G@roles:worker and not G@roles:webui' grains.get oscodename | grep -B1 'Leap 15.5'
is empty
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the related services are not heavily relied upon
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery or for virtual machines virt-manager access
- After upgrade reboot and check everything working as expected, if not rollback, e.g. with
snapper rollback
Rollback actions¶
- DONE Remove silence
alertname=jenkins: host up alert
from https://monitor.qa.suse.de/alerting/silences
Further details¶
- Don't worry, everything can be repaired :) If by any chance the machines gets misconfigured in many cases there are btrfs snapshots to recover, the IPMI Serial-over-LAN, etc.
Updated by okurz 9 months ago
- Copied from action #130648: Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.5 added
Updated by okurz 9 months ago
- Subject changed from Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.5 to Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.6
- Description updated (diff)
- Assignee deleted (
okurz) - Target version changed from Ready to future
- Start date deleted (
2023-06-09)
Updated by okurz 8 months ago
- Copied to action #160089: Handle uncommented package lock on "kernel-default" and "kernel-default-base" on openqa-piworker added
Updated by okurz 8 months ago
- Description updated (diff)
sudo salt --no-color -C 'not G@roles:worker and not G@roles:webui and not G@osrelease:15.6' cmd.run 'zypper ll'
shows that openqa-piworker has kernel locks for kernel-default but no explanation why:
openqa-piworker.qe.nue2.suse.org:
# | Name | Type | Repository | Comment
--+---------------------+---------+------------+--------
1 | kernel-default | package | (any) |
2 | kernel-default-base | package | (any) |
asked in https://suse.slack.com/archives/C02AJ1E568M/p1715189271425109
I will leave that in place for now but upgrade all nevertheless.
sudo salt --no-color -C 'not G@roles:worker and not G@roles:webui and not G@osrelease:15.6' cmd.run 'export new_version=15.6; zypper --releasever=$new_version ref && systemctl stop openqa-continuous-update.timer && zypper -n --releasever=$new_version dup --dry-run --auto-agree-with-licenses --replacefiles --download-in-advance'
http://download.opensuse.org/repositories/devel:/openQA:/monitoring/ is missing 15.6 as well as http://download.opensuse.org/repositories/home:/MMoese:/baremetal_support/
Enabled 15.6 in https://build.opensuse.org/projects/devel:openQA:monitoring/meta and removed EOL 15.4
sudo salt --no-color -C 'not G@roles:worker and not G@roles:webui and not G@osrelease:15.6' cmd.run 'export new_version=15.6; zypper --releasever=$new_version ref && systemctl stop openqa-continuous-update.timer && zypper -n --releasever=$new_version dup --auto-agree-with-licenses --replacefiles --download-in-advance && reboot'
Special handling for baremetal-support and backup-qam. On backup-qam multiple additional packages would be pulled in, hence doing
zypper rm -u -t pattern x11 SDK-C-C++ kvm_server gnome-basic Basis-Devel
and then successfully conducted an upgrade. jenkins is still running the zypper calls for the upgrade but is very slow in responses. It might be necessary to abort the upgrade, rollback to an older snapshot, reboot and try again. backup-vm seems to have aborted the upgrade half-way and is now stuck with an unusable zypper:
error while loading shared libraries: libabsl_log_internal_check_op.so.2308.0.0: cannot open shared object file: No such
file or directory
/var/log/salt/minion says
(314/800) Installing: libharfbuzz0-8.3.0-150600.1.2.x86_64 [....done]
(315/800) Installing: libgobject-2_0-0-2.78.3-150600.2.1.x86_64 [...done]
(316/800) Removing pkexec-121-150500.1.6.x86_64 [..
error: package pkexec-121-150500.1.6.x86_64 is not installed
error]
Removal of (109119)pkexec-121-150500.1.6.x86_64(@System) failed:
Error: Subprocess failed. Error: RPM failed: Command exited with status 1.
Abort, retry, ignore? [a/r/i] (a): a
Warning: %posttrans scripts skipped while aborting:
login_defs-4.8.1-150600.15.44.noarch
systemd-presets-common-SUSE-15-150600.25.2.noarch
systemd-presets-branding-openSUSE-12.2-lp156.6.1.noarch
openssl-1_1-1.1.1w-150600.2.16.x86_64
openSUSE-release-15.6-lp156.404.1.x86_64
systemd-254.10-150600.1.9.x86_64
suse-module-tools-15.6.7-150600.1.24.x86_64
kmod-29-150600.11.3.x86_64
udev-254.10-150600.1.9.x86_64
systemd-network-254.10-150600.1.9.x86_64
systemd-coredump-254.10-150600.1.9.x86_64
mdadm-4.3-150600.1.25.x86_64
dracut-059+suse.515.g83296e6f-150600.1.14.x86_64
adobe-sourcesanspro-fonts-2.045-bp156.3.1.noarch
sg3_utils-1.48+10.1532339-150600.1.2.x86_64
shadow-4.8.1-150600.15.44.x86_64
open-iscsi-2.1.9-150600.49.7.x86_64
haveged-1.9.14-150600.9.4.x86_64
so reverting to an older snapshot, rebooting, trying again.
On qamaster trying to start again the VM jenkins:
Error starting domain: internal error: process exited while connecting to monitor: /usr/bin/qemu-kvm: error while loading shared libraries: libxenctrl.so.4.17: cannot open shared object file: No such file or directory
Updated by okurz 8 months ago
I found that the DHCP entry for qamaster-sp is wrong. Fixed in https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5085
The following change seems problematic:
The following 6 packages are going to change vendor:
perl-DBD-Pg
3.18.0-lp155.2.2 -> 3.10.4-150600.12.2 x86_64 openSUSE-Leap-15.6-Oss obs://build.opensuse.org/devel:openQA -> SUSE LLC <https://www.suse.com/>
perl-DBD-SQLite
1.740.0-lp155.4.1 -> 1.66-150300.3.9.1 x86_64 openSUSE-Leap-15.6-Oss obs://build.opensuse.org/devel:openQA -> SUSE LLC <https://www.suse.com/>
perl-DBI
1.643-lp155.4.1 -> 1.642-3.9.1 x86_64 openSUSE-Leap-15.6-Oss obs://build.opensuse.org/devel:openQA -> SUSE LLC <https://www.suse.com/>
perl-Mojo-SQLite
3.009-lp155.2.1 -> 3.006-bp156.3.1 noarch openSUSE-Leap-15.6-Oss obs://build.opensuse.org/devel:openQA -> openSUSE
perl-Mojolicious
9.360.0-lp155.2.1 -> 9.350.0-bp156.1.1 noarch openSUSE-Leap-15.6-Oss obs://build.opensuse.org/devel:openQA -> openSUSE
perl-Perl-Tidy
20240202.0.0-lp155.3.1 -> 20230912.0.0-bp156.1.1 noarch openSUSE-Leap-15.6-Oss obs://build.opensuse.org/devel:openQA -> openSUSE
I guess for all those we want to have an updated package available in devel:openQA:Leap:15.6. Did
for i in perl-DBD-Pg perl-DBD-SQLite perl-DBI perl-Mojo-SQLite perl-Mojolicious perl-Perl-Tidy; do osc linkpac devel:openQA:Leap:15.5 $i devel:openQA:Leap:15.6; done
and in in each new package link added a comment pointing to the according comment from the Leap 15.5 project:
- https://build.opensuse.org/package/show/devel:openQA:Leap:15.6/perl-DBD-Pg
- https://build.opensuse.org/package/show/devel:openQA:Leap:15.6/perl-DBD-SQLite
- https://build.opensuse.org/package/show/devel:openQA:Leap:15.6/perl-DBI
- https://build.opensuse.org/package/show/devel:openQA:Leap:15.6/perl-Mojo-SQLite
- https://build.opensuse.org/package/show/devel:openQA:Leap:15.6/perl-Mojolicious
- https://build.opensuse.org/package/show/devel:openQA:Leap:15.6/perl-Perl-Tidy
Now https://build.opensuse.org/project/show/devel:openQA:Leap:15.6 has all what's necessary at least for jenkins.qe.nue2.suse.org which IIRC uses openQA-client. With that jenkins.qe.nue2.suse.org does not propose any vendor change for openQA related packages.
Updated by okurz 8 months ago
- Copied to action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:M added
Updated by okurz 8 months ago
- Copied to action #160098: After the upgrade to Leap 15.6 osiris showed no proper mount points again for libvirt VMs size:S added
Updated by okurz 8 months ago
- Due date set to 2024-05-22
- Status changed from In Progress to Feedback
Monitoring on https://openqa.suse.de/tests?resultfilter=Failed&resultfilter=Incomplete I found multiple related failures due to incomplete upgrades or missing reboots on hypervisor hosts so called
failed_since="2024-05-08 17:00Z" result="result='failed'" host=openqa.suse.de comment="label:poo157996" openqa-advanced-retrigger-jobs
on s390zl13 I see
May 08 22:43:40 s390zl13 virtqemud[22259]: unsupported configuration: machine type 's390-ccw-virtio-8.2' does not support ACPI
Probably due to https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshVirtsh.pm#L114 where os-autoinst adds such element to the VM config but likely there is no ACPI support on s390x VMs anymore now causing that problem. I called snapper rollback …
with a snapshot before the Leap 15.5 upgrade on both s390z12 and s390zl13 now and will retrigger according test failures.
Reported #160095 for the s390 problem. #160098 for a problem with drbd+libvirt on osiris.
Now
sudo salt --no-color -C 'not G@roles:worker and not G@roles:webui' cmd.run 'grep VERSION_ID /etc/os-release ; uptime' queue=True
shows that all relevant hosts except s390zl12+13 are upgraded
s390zl12.oqa.prg2.suse.org:
VERSION_ID="15.5"
23:20:38 up 0:25, 0 users, load average: 0.00, 0.00, 0.00
ada.qe.prg2.suse.org:
VERSION_ID="15.6"
23:20:38 up 0:36, 0 users, load average: 0.16, 0.24, 0.37
osiris-1.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 1:28, 0 users, load average: 0.00, 0.17, 0.22
storage.qe.prg2.suse.org:
VERSION_ID="15.6"
23:20:38 up 0:35, 0 users, load average: 0.00, 0.01, 0.06
s390zl13.oqa.prg2.suse.org:
VERSION_ID="15.5"
23:20:38 up 0:25, 40 users, load average: 37.61, 34.77, 26.62
schort-server.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 1:25, 0 users, load average: 0.03, 0.02, 0.05
jenkins.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 0:38, 0 users, load average: 0.00, 0.03, 0.11
backup-vm.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 1:26, 0 users, load average: 0.33, 0.08, 0.03
baremetal-support.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 1:25, 0 users, load average: 0.00, 0.00, 0.00
tumblesle.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 1:25, 0 users, load average: 0.08, 0.04, 0.18
monitor.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 1:25, 0 users, load average: 0.25, 0.31, 0.37
qamaster.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 1:27, 0 users, load average: 1.19, 1.84, 2.49
openqaw5-xen.qe.prg2.suse.org:
VERSION_ID="15.6"
23:20:38 up 0:39, 16 users, load average: 0.90, 1.18, 0.97
unreal6.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:38 up 0:47, 16 users, load average: 0.33, 0.55, 0.85
openqa-piworker.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:41 up 3:02, 0 users, load average: 1.04, 0.35, 0.22
backup-qam.qe.nue2.suse.org:
VERSION_ID="15.6"
23:20:54 up 2:03, 0 users, load average: 0.16, 0.09, 0.07
Putting into feedback for the next days and monitoring for the impact.