action #157975
openopenQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
Upgrade osd workers to openSUSE Leap 15.6 size:S
0%
Description
Motivation¶
Need to upgrade workers before EOL of Leap 15.5 and have a consistent environment size:S
Acceptance criteria¶
- AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.)
Acceptance tests¶
- AT1-1:
sudo salt -C 'G@roles:worker and not G@osrelease:15.6' test.ping
is empty
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the workers are only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- Apply the workaround for #162296, i.e.
zypper al -m "boo#1227616" *firewall*
- Start with non-ppc64le due to #169939
- After upgrade reboot and check everything working as expected, if not rollback, e.g. with
snapper rollback
- Consider also ppc64le but see #169939
Rollback steps¶
hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
ssh osd "sudo salt -C 'G@roles:worker' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"
Further details¶
- Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)
Updated by okurz 9 months ago
- Copied from action #130588: Upgrade osd workers to openSUSE Leap 15.5 added
Updated by okurz 6 months ago
- Related to action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" command added
Updated by okurz 6 months ago
- Status changed from New to In Progress
- Assignee set to okurz
- Target version changed from Tools - Next to Ready
okurz wrote in #note-4:
Multiple workers upgraded themselves automatically to 15.6 as the main repo URL points to http://download.opensuse.org/distribution/openSUSE-current/repo/oss . Maybe that was me doing that? This might have caused #162239
I caused this problem with #132137-6 because then the installer uses openSUSE-current also in the deployed repositories but that's then incompatible with update repos which are either hardcoded to the version like 15.5 or use $releasever
.
So I called
sudo salt \* cmd.run 'sed -i -e "s@openSUSE-current@leap/\$releasever@" /etc/zypp/repos.d/*'
to correct. Now we should push through with the upgrade to Leap 15.6 to be consistent
Updated by okurz 6 months ago
- Copied to action #162284: Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not work size:M added
Updated by okurz 6 months ago · Edited
- Description updated (diff)
First w31 where I triggered a reboot. After reboot it took 20m(!) for the machine to be fully reachable over network causing issues because os-autoinst-openvswitch would timeout causing incomplete jobs like https://openqa.suse.de/tests/14611797 . Triggered another reboot and then
for i in WORKER="worker31"; do host=openqa.suse.de failed_since=2024-06-14 comment="label:poo157975" ./openqa-advanced-retrigger-jobs; done
I disabled the openQA worker instances on w31 for now:
sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs)
and aborted reboot requests and masked rebootmgr for now:
sh osd "sudo salt -C 'G@roles:worker' cmd.run 'rebootmgrctl cancel && systemctl mask --now rebootmgr'"
and added according rollback steps.
We found that the system is not able to bring up the network completely and also crashes on a kernel panic so reverting to older version of kernel. systemctl status wicked*
showed that wicked-nanny is running into dbus related timeouts.
Booting into older kernel:
# grep 'submenu\|menuentry\>' /boot/grub2/grub.cfg
menuentry "Help on bootable snapshot #$snapshot_num" {
menuentry 'openSUSE Leap 15.6' --class opensuse --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-
b985246c-fac6-488c-9917-cb53e1d3fd6d' {
submenu 'Advanced options for openSUSE Leap 15.6' --hotkey=1 $menuentry_id_option 'gnulinux-advanced-b985246c-fac6-488c-9917-cb5
3e1d3fd6d' {
menuentry 'openSUSE Leap 15.6, with Linux 6.4.0-150600.21-default' --hotkey=2 --class opensuse --class gnu-linux --class
gnu --class os $menuentry_id_option 'gnulinux-6.4.0-150600.21-default-advanced-b985246c-fac6-488c-9917-cb53e1d3fd6d' {
menuentry 'openSUSE Leap 15.6, with Linux 6.4.0-150600.21-default (recovery mode)' --hotkey=3 --class opensuse --class $
nu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.4.0-150600.21-default-recovery-b985246c-fac6-488c-9917-cb53e1d$
fd6d' {
menuentry 'openSUSE Leap 15.6, with Linux 5.14.21-150500.55.65-default' --class opensuse --class gnu-linux --class gnu
--class os $menuentry_id_option 'gnulinux-5.14.21-150500.55.65-default-advanced-b985246c-fac6-488c-9917-cb53e1d3fd6d' {
menuentry 'openSUSE Leap 15.6, with Linux 5.14.21-150500.55.65-default (recovery mode)' --hotkey=1 --class opensuse --c$
ass gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.14.21-150500.55.65-default-recovery-b985246c-fac6-488c-99$
7-cb53e1d3fd6d' {
menuentry 'UEFI Firmware Settings' $menuentry_id_option 'uefi-firmware' {
worker31:~ # grub2-reboot '1>3'
Updated by okurz 6 months ago
- Copied to action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:M added
Updated by okurz 6 months ago
That was not successful, no output after the initial kernel loading. I booted into 6.4 again and then called snapper rollback $id
on the $id
from 2024-06-12 before the upgrade to 15.6 happened. Rebooted and the system was coming up just fine again but struggles with SMART errors. For this I extracted a worker31 specific ticket into #162293. Crosschecked with w32 and it behaves the same. So I assume it's a generic problem which at least affects all our happyware servers w29-w40 the same, possibly o3 workers w21-w28 as well. Now doing the corresponding snapper rollback on all and patching the repo config as in #157975-8
Updated by okurz 6 months ago
Downgraded w29, w30, w31, w32, w33, w34, w35, w40, warm1, warm2. w36-39 are offline as they are currently not used. And applied
sed -i -e "s@openSUSE-current@leap/\$releasever@" -e "s@15\.5/@\$releasever/@g" /etc/zypp/repos.d/* ; zypper ref && zypper -n dup --dry-run
on all. Also checked systemctl status
and alerts. Seems we are back to status "green" again for now. Also all systems currently in control of salt reachable. I recommend we try the upgrade again first with another machine, e.g. a NUE2 or PRG1 based one. openqaworker14 seems to be a good candidate running only worker classes that we have redundant and only qemu. Also we can consider w36-w40 related to #139103
Updated by okurz 6 months ago
- Related to action #162260: auto-update.service fails on various workers due to a package conflict added
Updated by okurz 5 months ago
- Copied to action #163472: Upgrade a single osd worker to openSUSE Leap 15.6 added
Updated by okurz 22 days ago
- Related to action #169939: Upgrade Power8 o3 workers to openSUSE Leap 15.6 added
Updated by dheidler 10 days ago
Is the described issue maybe the same as https://bugzilla.suse.com/show_bug.cgi?id=1227616 (note that the bug was reported against 15SP6 but is actually for 15.6).
Updated by dheidler 10 days ago
Or phrased otherwise: Should we block on https://progress.opensuse.org/issues/162296 and/or escalate this matter as the bug blocks us from updating?
Updated by okurz 10 days ago
dheidler wrote in #note-25:
Or phrased otherwise: Should we block on https://progress.opensuse.org/issues/162296 and/or escalate this matter as the bug blocks us from updating?
From the bug report it seems clear that nobody else reproduced the issue so if we don't investigate further in detail an improvement is unlikely. That is something to be done in #162296 .
However here we can still upgrade with the known workarounds
Updated by ybonatakis 8 days ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by ybonatakis 8 days ago
iob@openqa:~> sudo salt -C 'G@roles:worker and not G@osrelease:15.6' test.ping
openqaworker17.qa.suse.cz:
True
openqaworker18.qa.suse.cz:
True
worker29.oqa.prg2.suse.org:
True
openqaworker16.qa.suse.cz:
True
worker31.oqa.prg2.suse.org:
True
worker30.oqa.prg2.suse.org:
True
worker35.oqa.prg2.suse.org:
True
qesapworker-prg7.qa.suse.cz:
True
qesapworker-prg4.qa.suse.cz:
True
worker33.oqa.prg2.suse.org:
True
worker40.oqa.prg2.suse.org:
True
worker-arm2.oqa.prg2.suse.org:
True
worker34.oqa.prg2.suse.org:
True
openqaworker14.qa.suse.cz:
True
qesapworker-prg5.qa.suse.cz:
True
qesapworker-prg6.qa.suse.cz:
True
worker-arm1.oqa.prg2.suse.org:
True
worker32.oqa.prg2.suse.org:
True
diesel.qe.nue2.suse.org:
True
petrol.qe.nue2.suse.org:
True
mania.qe.nue2.suse.org:
True
sapworker1.qe.nue2.suse.org:
True
grenache-1.oqa.prg2.suse.org:
True
Updated by openqa_review 7 days ago
- Due date set to 2024-12-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 7 days ago
starting with openqaworker17
openqaworker17:/home/iob # cat /etc/os-release
NAME="openSUSE Leap"
VERSION="15.6"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.6"
PRETTY_NAME="openSUSE Leap 15.6"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.6"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Leap"
Only issue I encountered was a conflict with python3-bind
error: unpacking of archive failed on file /usr/lib/python3.6/site-packages/isc-2.0-py3.6.egg-info: cpio: File from package already exists as a directory in system
error: python3-bind-9.16.20-150400.3.6.noarch: install failed
error: python3-bind-9.16.50-150500.8.21.1.noarch: erase skipped
I backed up the file and removed it. upgrade was smooth after that.
Updated by ybonatakis 7 days ago
today morning I found auto-update.service failiing. after a restart it became inactive
. Not sure if that was caused from the locked packages for firewall. Befor
openqaworker17:/home/iob # zypper ll
# | Name | Type | Repository | Comment
--+-----------------------------+---------+------------+--------------------------------------------------------------------
1 | *firewall* | package | (any) | boo#1227616
2 | openSUSE-SLE-15.6-2024-3393 | patch | (any) | AUTO-UPDATE CONFLICTING PATCH: patch would conflict with *firewall*
3 | openSUSE-SLE-15.6-2024-3953 | patch | (any) | AUTO-UPDATE CONFLICTING PATCH: patch would conflict with *firewall*
4 | x3270 | package | (any) |
openqaworker17:/home/iob # systemctl status auto-update.service
○ auto-update.service - Automatically patch system packages.
Loaded: loaded (/etc/systemd/system/auto-update.service; static)
Active: inactive (dead) since Thu 2024-12-12 08:06:19 UTC; 2min 4s ago
Duration: 6.185s
TriggeredBy: ● auto-update.timer
Process: 5011 ExecStart=/usr/local/bin/auto-update (code=exited, status=0/SUCCESS)
Main PID: 5011 (code=exited, status=0/SUCCESS)
CPU: 6.164s
Dec 12 08:06:19 openqaworker17 auto-update[5134]: firewall-applet firewall-config firewalld-bash-completion firewalld-prometheus-config firewalld-rpcbind-helper firewalld-test firewalld-zsh-completion firewall-macros keylime-firewalld patch:openSUSE-SLE-15.6-2024-3393 patch:openSUSE-SLE-15.6-2024-3953 plasma5-firewall plasma5-firewall-lang SuSEfirewall2 su>
Dec 12 08:06:19 openqaworker17 auto-update[5134]: Installed:
Dec 12 08:06:19 openqaworker17 auto-update[5134]: firewalld firewalld-lang python3-firewall x3270 yast2-firewall
Dec 12 08:06:19 openqaworker17 auto-update[5134]: Nothing to do.
Dec 12 08:06:19 openqaworker17 auto-update[5011]: + [[ 0 == 4 ]]
Dec 12 08:06:19 openqaworker17 auto-update[5011]: + [[ 0 == 102 ]]
Dec 12 08:06:19 openqaworker17 auto-update[5011]: + return 0
Dec 12 08:06:19 openqaworker17 auto-update[5011]: + needs-restarting --reboothint
Dec 12 08:06:19 openqaworker17 systemd[1]: auto-update.service: Deactivated successfully.
Dec 12 08:06:19 openqaworker17 systemd[1]: auto-update.service: Consumed 6.164s CPU time.
Updated by ybonatakis 6 days ago
- Status changed from In Progress to Workable
pause due to https://progress.opensuse.org/issues/174319
Updated by ybonatakis 6 days ago
- Related to action #174319: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3520298#L74 fails with "File './x86_64/glibc-2.38-150600.14.17.2.x86_64.rpm' not found on medium 'http://download.opensuse.org/update/leap/15.5/sle/'" size:S added
Updated by ybonatakis 3 days ago
openqaworker18 updated. I didnt see any problem
Updated by livdywan about 22 hours ago
So openqaworker17 + openqaworker18 are uptodate. Further workers are planned to be updated (in parallel) next.
Updated by ybonatakis about 20 hours ago
- Priority changed from High to Normal
openqaworker16 updated
Updated by okurz about 18 hours ago
- Status changed from Workable to In Progress
- Priority changed from Normal to High
please keep the prio "High"