tickets #162326
openLeap 15.6 upgrade diary
Added by crameleon 15 days ago. Updated 7 days ago.
30%
Description
Update all the Leap based machines from 15.5 to 15.6, track the progress and anything noteworthy using comments here.
Updated by crameleon 15 days ago
- Due date set to 2024-06-12
- Start date changed from 2024-06-15 to 2024-06-12
- Follows tickets #162092: Prepare openSUSE:infrastructure* for 15.6 added
Updated by crameleon 14 days ago
Change for new HAProxy version:
Changes for new Apache httpd version:
- https://github.com/openSUSE/salt-formulas/commit/3929b379b621e21ac4ef721a1b54813c5fa61b7b and https://build.opensuse.org/package/rdiff/openSUSE:infrastructure/container-heroes-salt-testing-systemd?linkrev=base&rev=14
- https://build.opensuse.org/package/rdiff/openSUSE:infrastructure/container-heroes-salt-testing-prometheus?linkrev=base&rev=11 (workaround for https://bugzilla.opensuse.org/show_bug.cgi?id=1226379)
Updated by crameleon 14 days ago
I tried speeding up the Kanidm problem by linking it from Factory to openSUSE:infrastructure until it exists in backports, but the package is broken and does not build with debuginfo (which is enabled in o:i): https://bugzilla.opensuse.org/show_bug.cgi?id=1222595.
Updated by crameleon 13 days ago
- Status changed from Blocked to In Progress
Kanidm is still stuck in https://build.opensuse.org/request/show/1180364, but I worked around the problem in o:i for now by setting <debuginfo><disable/></debuginfo>
in the linked kanidm
package.
This allows us to zypper --releasever=15.6 dup --allowe-vendor-change
(vendor change is necessary to switch from distribution Kanidm to the o:i one - when 1180364 is through we vendor change all installations back).
Updated by crameleon 13 days ago
Done:
- download.i.o.o
- thor1.i.o.o
- devcon.i.o.o
- warp.i.o.o
Done with problems:
- witch1.i.o.o:
=> after the upgrade, the Salt master on this machine no longer works properly, all state operations return:
[ERROR ] The 'production' saltenv has no top file, and the fallback saltenv specified by default_top (production) also has no top file
local:
----------
ID: states
Function: no.None
Result: False
Comment: No Top file or master_tops data matches found. Please see master log for details.
Changes:
Summary for local
------------
Succeeded: 0
Failed: 1
------------
Total states run: 1
Total run time: 0.000 ms
- squanchy.i.o.o:
=> After the upgrade, I am locked out of the machine. Through Salt I run some commands:
root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'rcsshd status'
jid: 20240617181247828542
squanchy.infra.opensuse.org:
* sshd.service - OpenSSH Daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: disabled)
Active: active (running) since Mon 2024-06-17 18:05:12 UTC; 7min ago
Process: 16601 ExecStartPre=/usr/sbin/sshd-gen-keys-start (code=exited, status=0/SUCCESS)
Process: 16645 ExecStartPre=/usr/sbin/sshd -t $SSHD_OPTS (code=exited, status=0/SUCCESS)
Main PID: 16712 (sshd)
Tasks: 1
CPU: 784ms
CGroup: /system.slice/sshd.service
`-16712 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"
Jun 17 18:09:24 squanchy sshd[25635]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
Jun 17 18:09:30 squanchy sshd[25641]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
Jun 17 18:09:52 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
Jun 17 18:09:53 squanchy sshd[25656]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=2a07:de40:b27e:1201::3 user=root
Jun 17 18:09:54 squanchy sshd[25651]: error: PAM: Authentication failure for root from 2a07:de40:b27e:1201::3
Jun 17 18:09:54 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
Jun 17 18:09:56 squanchy sshd[25651]: Connection closed by authenticating user root 2a07:de40:b27e:1201::3 port 40084 [preauth]
Jun 17 18:10:34 squanchy sshd[25711]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
Jun 17 18:11:07 squanchy sshd[25733]: Connection closed by 2a07:de40:b27e:1100::a port 48210 [preauth]
Jun 17 18:11:16 squanchy sshd[25739]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'systemctl status kanidm-unixd' [109/202]
jid: 20240617181244168407
squanchy.infra.opensuse.org:
* kanidm-unixd.service - Kanidm Local Client Resolver
Loaded: loaded (/usr/lib/systemd/system/kanidm-unixd.service; enabled; preset: disabled)
Active: active (running) since Mon 2024-06-17 18:12:39 UTC; 4s ago
Main PID: 25810 (kanidm_unixd)
Tasks: 4 (limit: 4915)
CPU: 10.547s
CGroup: /system.slice/kanidm-unixd.service
`-25810 /usr/sbin/kanidm_unixd
Jun 17 18:12:29 squanchy systemd[1]: Starting Kanidm Local Client Resolver...
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN 🚧 [warn]: WARNING: DB folder /var/cache/kanidm-unixd has 'everyone' permissio
n bits in the mode. This could be a security risk ...
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: ERROR:tcti:src/tss2-tcti/tctildr.c:428:Tss2_TctiLdr_Initialize_Ex() Failed to instantiate TCTI
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 ERROR 🚨 [error]: | tpm_err: TssError(Tcti(TctiReturnCode { base_error: NotSupporte
d }))
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN 🚧 [warn]: Unable to open requested tpm device, falling back to soft tpm | tpm
_err: TpmContextCreate
Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 INFO i [info]: Server started ...
Jun 17 18:12:39 squanchy systemd[1]: Started Kanidm Local Client Resolver.
I sent it a restart of kanidm-unixd
which did not help.
Updated by crameleon 13 days ago · Edited
- falkor21.i.o.o dead after upgrade, freezes at POST when booting from the default boot entry, following up in separate ticket: https://progress.opensuse.org/issues/162401.
Updated by crameleon 13 days ago
- Related to tickets #162401: falkor21.i.o.o freezes at POST added
Updated by firstyear 11 days ago
The problem is that while Kanidm was accepted here https://build.opensuse.org/request/show/1180285 it's not actually available yet. Because of this zypper considers it as needing removal:
Warning: Enforced setting: $releasever=15.6
Loading repository data...
Reading installed packages...
Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
Computing distribution upgrade...
The following 380 packages are going to be upgraded:
...
The following 5 packages are going to be REMOVED:
kanidm-clients kanidm-unixd-clients libabsl2308_0_0 nfsidmap systemd-sysvinit
At this point I have no idea where the pipeline goes or how it works, so to me it's lost in the void. We'll need someone else to help find where it's stuck and why.
Updated by crameleon 10 days ago · Edited
@firstyear Your submission was accepted, yes, but not the release of the update: https://build.opensuse.org/request/show/1180364 (see my comment https://progress.opensuse.org/issues/162326?issue_count=403&issue_position=22&next_issue_id=162317&prev_issue_id=162329#note-6 which also includes my workaround).
Updated by crameleon 10 days ago
I made https://bugzilla.opensuse.org/show_bug.cgi?id=1226639 for the PAM issue now.
Updated by crameleon 10 days ago
- provo-gate.i.o.o failed
numad
after upgrade:
Jun 20 16:04:16 provo-gate systemd[1]: Started numad - The NUMA daemon that manages application locality..
Jun 20 16:04:16 provo-gate numad[629]: Are CPUSETs enabled on this system?
Jun 20 16:04:16 provo-gate numad[629]: They are required for /usr/sbin/numad to function.
Jun 20 16:04:16 provo-gate numad[629]: Check manpage CPUSET(7). You might need to do something like:
Jun 20 16:04:16 provo-gate numad[629]: # mkdir <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]: # mount cgroup -t cgroup -o cpuset <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]: where <DIRECTORY_MOUNT_POINT> is something like:
Jun 20 16:04:16 provo-gate numad[629]: - /sys/fs/cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]: - /cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]: and then try again...
Jun 20 16:04:16 provo-gate numad[629]: Or, use '-D <DIRECTORY_MOUNT_POINT>' to specify the correct mount point
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Main process exited, code=exited, status=1/FAILURE
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Failed with result 'exit-code'.
It run fine before, I made https://bugzilla.opensuse.org/show_bug.cgi?id=1226649.
Updated by crameleon 10 days ago
I tracked the Salt root:root permission problem down to rsync, and made https://bugzilla.opensuse.org/show_bug.cgi?id=1226656 because I cannot figure it out albeit trying different variations of --owner, --group, --super, --chown and the manual and changelog not indicating anything obvious. Using rsync over ssh from Tumbleweed, the options still work fine. It's either specific to 15.6 or the rsync:// protocol but I did not test further.
Updated by crameleon 10 days ago
PAM issue is due to pam-config being issued with --force during %post if /etc/pam.d/common-auth-pc
is missing.
Needs to be corrected on these machines before the upgrade:
root@witch1 ~# salt --out-file=/dev/shm/auth-pc --out=text \*.infra.opensuse.org file.file_exists /etc/pam.d/common-auth-pc
root@witch1 ~# grep False /dev/shm/auth-pc
osc-collab.infra.opensuse.org: False
falkor22.infra.opensuse.org: False
ipx-narwal1.infra.opensuse.org: False
ipx-proxy1.infra.opensuse.org: False
nala.infra.opensuse.org: False
mirrorcache-us.infra.opensuse.org: False
narwal4.infra.opensuse.org: False
nala2.infra.opensuse.org: False
status2.infra.opensuse.org: False
mx4.infra.opensuse.org: False
login3.infra.opensuse.org: False
mirrorcache-us-db.infra.opensuse.org: False
provo-mirror.infra.opensuse.org: False
mx3.infra.opensuse.org: False
Updated by crameleon 8 days ago · Edited
- % Done changed from 10 to 20
Done:
- prg-ns1
- prg-ns2
- mx1
- mx2
- mx-test
mx* needed removal of clamav from openSUSE:infrastructure (version in the distribution is now new enough), and a patch for mtail (for some reason, an additional system call is needed - since the mtail version did not change, maybe something in the default systemd syscall sets changed?): https://build.opensuse.org/request/show/1182631.
Same numad failure as earlier, but oddly only on mx2 - on mx1, numad started fine with the same version.
Updated by crameleon 7 days ago
- % Done changed from 20 to 30
Done:
- narwal{4,5,6,7,8}
- ipx-narwal1
- water{,3,4}
- tsp
- paste
- mx{3,4}
- svn
- rpmlint
- qsc-ns3
- progressoo
- calendar
- netbox1
- slimhat
- pinot
- opi-proxy
- stonehat
- status3
stonehat was a bit interesting as apparently the management address relies on a libvirt network which starts automatically, but only has its virtual interface created when at least one VM using the network is started - libvirt-guests seems to not have resumed the previously running VMs, requiring the need for console intervention (which was interesting too, since no passphrase was recorded in the store - I corrected this now) but in any case the machine is rather poorly configured, so probably not an upgrade issue (https://progress.opensuse.org/issues/151453).