tickets #162326
openLeap 15.6 upgrade diary
90%
Description
Update all the Leap based machines from 15.5 to 15.6, track the progress and anything noteworthy using comments here.
Updated by crameleon 5 months ago
- Due date set to 2024-06-12
- Start date changed from 2024-06-15 to 2024-06-12
- Follows tickets #162092: Prepare openSUSE:infrastructure* for 15.6 added
Updated by crameleon 5 months ago
Change for new HAProxy version:
Changes for new Apache httpd version:
- https://github.com/openSUSE/salt-formulas/commit/3929b379b621e21ac4ef721a1b54813c5fa61b7b and https://build.opensuse.org/package/rdiff/openSUSE:infrastructure/container-heroes-salt-testing-systemd?linkrev=base&rev=14
- https://build.opensuse.org/package/rdiff/openSUSE:infrastructure/container-heroes-salt-testing-prometheus?linkrev=base&rev=11 (workaround for https://bugzilla.opensuse.org/show_bug.cgi?id=1226379)
Updated by crameleon 5 months ago
I tried speeding up the Kanidm problem by linking it from Factory to openSUSE:infrastructure until it exists in backports, but the package is broken and does not build with debuginfo (which is enabled in o:i): https://bugzilla.opensuse.org/show_bug.cgi?id=1222595.
Updated by crameleon 5 months ago
- Status changed from Blocked to In Progress
Kanidm is still stuck in https://build.opensuse.org/request/show/1180364, but I worked around the problem in o:i for now by setting <debuginfo><disable/></debuginfo>
in the linked kanidm
package.
This allows us to zypper --releasever=15.6 dup --allowe-vendor-change
(vendor change is necessary to switch from distribution Kanidm to the o:i one - when 1180364 is through we vendor change all installations back).
Updated by crameleon 5 months ago
Done:
- download.i.o.o
- thor1.i.o.o
- devcon.i.o.o
- warp.i.o.o
Done with problems:
- witch1.i.o.o:
=> after the upgrade, the Salt master on this machine no longer works properly, all state operations return:
[ERROR ] The 'production' saltenv has no top file, and the fallback saltenv specified by default_top (production) also has no top file
local:
----------
ID: states
Function: no.None
Result: False
Comment: No Top file or master_tops data matches found. Please see master log for details.
Changes:
Summary for local
------------
Succeeded: 0
Failed: 1
------------
Total states run: 1
Total run time: 0.000 ms
- squanchy.i.o.o:
=> After the upgrade, I am locked out of the machine. Through Salt I run some commands:
root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'rcsshd status'
jid: 20240617181247828542
squanchy.infra.opensuse.org:
* sshd.service - OpenSSH Daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: disabled)
Active: active (running) since Mon 2024-06-17 18:05:12 UTC; 7min ago
Process: 16601 ExecStartPre=/usr/sbin/sshd-gen-keys-start (code=exited, status=0/SUCCESS)
Process: 16645 ExecStartPre=/usr/sbin/sshd -t $SSHD_OPTS (code=exited, status=0/SUCCESS)
Main PID: 16712 (sshd)
Tasks: 1
CPU: 784ms
CGroup: /system.slice/sshd.service
`-16712 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"
Jun 17 18:09:24 squanchy sshd[25635]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
Jun 17 18:09:30 squanchy sshd[25641]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
Jun 17 18:09:52 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
Jun 17 18:09:53 squanchy sshd[25656]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=2a07:de40:b27e:1201::3 user=root
Jun 17 18:09:54 squanchy sshd[25651]: error: PAM: Authentication failure for root from 2a07:de40:b27e:1201::3
Jun 17 18:09:54 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
Jun 17 18:09:56 squanchy sshd[25651]: Connection closed by authenticating user root 2a07:de40:b27e:1201::3 port 40084 [preauth]
Jun 17 18:10:34 squanchy sshd[25711]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
Jun 17 18:11:07 squanchy sshd[25733]: Connection closed by 2a07:de40:b27e:1100::a port 48210 [preauth]
Jun 17 18:11:16 squanchy sshd[25739]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'systemctl status kanidm-unixd' [109/202]
jid: 20240617181244168407
squanchy.infra.opensuse.org:
* kanidm-unixd.service - Kanidm Local Client Resolver
Loaded: loaded (/usr/lib/systemd/system/kanidm-unixd.service; enabled; preset: disabled)
Active: active (running) since Mon 2024-06-17 18:12:39 UTC; 4s ago
Main PID: 25810 (kanidm_unixd)
Tasks: 4 (limit: 4915)
CPU: 10.547s
CGroup: /system.slice/kanidm-unixd.service
`-25810 /usr/sbin/kanidm_unixd
Jun 17 18:12:29 squanchy systemd[1]: Starting Kanidm Local Client Resolver...
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN 🚧 [warn]: WARNING: DB folder /var/cache/kanidm-unixd has 'everyone' permissio
n bits in the mode. This could be a security risk ...
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: ERROR:tcti:src/tss2-tcti/tctildr.c:428:Tss2_TctiLdr_Initialize_Ex() Failed to instantiate TCTI
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 ERROR 🚨 [error]: | tpm_err: TssError(Tcti(TctiReturnCode { base_error: NotSupporte
d }))
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN 🚧 [warn]: Unable to open requested tpm device, falling back to soft tpm | tpm
_err: TpmContextCreate
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 INFO i [info]: Server started ...
Jun 17 18:12:39 squanchy systemd[1]: Started Kanidm Local Client Resolver.
I sent it a restart of kanidm-unixd
which did not help.
Updated by crameleon 5 months ago · Edited
- falkor21.i.o.o dead after upgrade, freezes at POST when booting from the default boot entry, following up in separate ticket: https://progress.opensuse.org/issues/162401.
Updated by crameleon 5 months ago
- Related to tickets #162401: falkor21.i.o.o freezes at POST added
Updated by firstyear 5 months ago
The problem is that while Kanidm was accepted here https://build.opensuse.org/request/show/1180285 it's not actually available yet. Because of this zypper considers it as needing removal:
Warning: Enforced setting: $releasever=15.6
Loading repository data...
Reading installed packages...
Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
Computing distribution upgrade...
The following 380 packages are going to be upgraded:
...
The following 5 packages are going to be REMOVED:
kanidm-clients kanidm-unixd-clients libabsl2308_0_0 nfsidmap systemd-sysvinit
At this point I have no idea where the pipeline goes or how it works, so to me it's lost in the void. We'll need someone else to help find where it's stuck and why.
Updated by crameleon 5 months ago · Edited
@firstyear Your submission was accepted, yes, but not the release of the update: https://build.opensuse.org/request/show/1180364 (see my comment https://progress.opensuse.org/issues/162326?issue_count=403&issue_position=22&next_issue_id=162317&prev_issue_id=162329#note-6 which also includes my workaround).
Updated by crameleon 5 months ago
I made https://bugzilla.opensuse.org/show_bug.cgi?id=1226639 for the PAM issue now.
Updated by crameleon 5 months ago
- provo-gate.i.o.o failed
numad
after upgrade:
Jun 20 16:04:16 provo-gate systemd[1]: Started numad - The NUMA daemon that manages application locality..
Jun 20 16:04:16 provo-gate numad[629]: Are CPUSETs enabled on this system?
Jun 20 16:04:16 provo-gate numad[629]: They are required for /usr/sbin/numad to function.
Jun 20 16:04:16 provo-gate numad[629]: Check manpage CPUSET(7). You might need to do something like:
Jun 20 16:04:16 provo-gate numad[629]: # mkdir <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]: # mount cgroup -t cgroup -o cpuset <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]: where <DIRECTORY_MOUNT_POINT> is something like:
Jun 20 16:04:16 provo-gate numad[629]: - /sys/fs/cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]: - /cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]: and then try again...
Jun 20 16:04:16 provo-gate numad[629]: Or, use '-D <DIRECTORY_MOUNT_POINT>' to specify the correct mount point
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Main process exited, code=exited, status=1/FAILURE
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Failed with result 'exit-code'.
It run fine before, I made https://bugzilla.opensuse.org/show_bug.cgi?id=1226649.
Updated by crameleon 5 months ago
I tracked the Salt root:root permission problem down to rsync, and made https://bugzilla.opensuse.org/show_bug.cgi?id=1226656 because I cannot figure it out albeit trying different variations of --owner, --group, --super, --chown and the manual and changelog not indicating anything obvious. Using rsync over ssh from Tumbleweed, the options still work fine. It's either specific to 15.6 or the rsync:// protocol but I did not test further.
Updated by crameleon 5 months ago
PAM issue is due to pam-config being issued with --force during %post if /etc/pam.d/common-auth-pc
is missing.
Needs to be corrected on these machines before the upgrade:
root@witch1 ~# salt --out-file=/dev/shm/auth-pc --out=text \*.infra.opensuse.org file.file_exists /etc/pam.d/common-auth-pc
root@witch1 ~# grep False /dev/shm/auth-pc
osc-collab.infra.opensuse.org: False
falkor22.infra.opensuse.org: False
ipx-narwal1.infra.opensuse.org: False
ipx-proxy1.infra.opensuse.org: False
nala.infra.opensuse.org: False
mirrorcache-us.infra.opensuse.org: False
narwal4.infra.opensuse.org: False
nala2.infra.opensuse.org: False
status2.infra.opensuse.org: False
mx4.infra.opensuse.org: False
login3.infra.opensuse.org: False
mirrorcache-us-db.infra.opensuse.org: False
provo-mirror.infra.opensuse.org: False
mx3.infra.opensuse.org: False
Updated by crameleon 5 months ago · Edited
- % Done changed from 10 to 20
Done:
- prg-ns1
- prg-ns2
- mx1
- mx2
- mx-test
mx* needed removal of clamav from openSUSE:infrastructure (version in the distribution is now new enough), and a patch for mtail (for some reason, an additional system call is needed - since the mtail version did not change, maybe something in the default systemd syscall sets changed?): https://build.opensuse.org/request/show/1182631.
Same numad failure as earlier, but oddly only on mx2 - on mx1, numad started fine with the same version.
Updated by crameleon 5 months ago
- % Done changed from 20 to 30
Done:
- narwal{4,5,6,7,8}
- ipx-narwal1
- water{,3,4}
- tsp
- paste
- mx{3,4}
- svn
- rpmlint
- qsc-ns3
- progressoo
- calendar
- netbox1
- slimhat
- pinot
- opi-proxy
- stonehat
- status3
stonehat was a bit interesting as apparently the management address relies on a libvirt network which starts automatically, but only has its virtual interface created when at least one VM using the network is started - libvirt-guests seems to not have resumed the previously running VMs, requiring the need for console intervention (which was interesting too, since no passphrase was recorded in the store - I corrected this now) but in any case the machine is rather poorly configured, so probably not an upgrade issue (https://progress.opensuse.org/issues/151453).
Updated by crameleon 4 months ago
Made bug for stonehat libvirt issue: https://bugzilla.opensuse.org/show_bug.cgi?id=1228073.
There is another problem on stonehat, every few days it stops routing any network packets (still has correct addresses and routes configured, but all network activity is broken, ping-ing anywhere fails) - requiring a reboot to work again (just restarting network.service does not help).
Updated by crameleon 4 months ago
- falkor2{0,2} upgraded as well, because the mismatching version with falkor21 already upgraded started causing issues - added the 25_bli patch, and we need to be careful upon reboots to remount /kvm - the ARP problem could probably be mitigated by switching the NFS connection to IPv6 (which I wanted to do since some time already anyways, since it's the only remaining legacy IP connectivity on the clusters).
Updated by crameleon 3 months ago
Done:
- pagure01
Services failed with import errors after the upgrade, due Leap 15.6 getting a new pygit2 version which is victim of https://github.com/libgit2/pygit2/commit/a8b2421bea55029296cc79ac7c1518b9885d8a6f. Hotpatched for now, and submitted the already existing upstream patch (pagure.git@8a1a7ba9f789ba446bab63783f7b963246861cb8) to our package: https://build.opensuse.org/request/show/1194456.
Updated by crameleon 3 months ago
Done:
- mirrorcache-us-db.i.o.o
Aborted mirrorcache-us.i.o.o due to
Detected 8 file conflicts:
File /usr/lib/perl5/vendor_perl/5.26.1/Time/CTime.pm
from install of
perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
conflicts with file from install of
perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)
File /usr/lib/perl5/vendor_perl/5.26.1/Time/ParseDate.pm
from install of
perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
conflicts with file from install of
perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)
File /usr/lib/perl5/vendor_perl/5.26.1/Time/Timezone.pm
from install of
perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
conflicts with file from install of
perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)
File /usr/share/man/man3/Time::CTime.3pm.gz
from install of
perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
conflicts with file from install of
perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)
File /usr/share/man/man3/Time::DaysInMonth.3pm.gz
from install of
perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
conflicts with file from install of
perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)
File /usr/share/man/man3/Time::JulianDay.3pm.gz
from install of
perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
conflicts with file from install of
perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)
File /usr/share/man/man3/Time::ParseDate.3pm.gz
from install of
perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
conflicts with file from install of
perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)
File /usr/share/man/man3/Time::Timezone.3pm.gz
from install of
perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
conflicts with file from install of
perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)
File conflicts happen when two packages attempt to install files with the same name but different contents. If you continue, conflicting files will be replaced losing the previous content.
Continue? [yes/no] (no): no
Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
History:
- ABORT request:
Please see the above error message for a hint.
@andriinikitin ^ can the problematic packages from repo-oss be removed?
Updated by crameleon 3 months ago
- Related to tickets #165425: obs-reviewlab.o.o down after upgrade added
Updated by andriinikitin 3 months ago
crameleon wrote in #note-34:
@andriinikitin ^ can the problematic packages from repo-oss be removed?
Hej sorry for delay, not sure what is wrong with my notifications.
Yes, I have checked and pontifex has only perl-Time-modules
so perl-Time-ParseDate
can be removed.
Let me know if I should do it.
Updated by crameleon 3 months ago · Edited
- % Done changed from 80 to 90
Done:
- status1
MariaDB would time out upon starting as the schema upgrade took a long time. The default unit has TimeoutSec=300
. I first tried with TimeoutStartSec=600
, but it was not enough. Eventually, TimeoutStartSec=7200
gave it enough time.