Project

General

Profile

Actions

tickets #162326

open

Leap 15.6 upgrade diary

Added by crameleon 15 days ago. Updated 7 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2024-06-12
Due date:
% Done:

30%

Estimated time:

Description

Update all the Leap based machines from 15.5 to 15.6, track the progress and anything noteworthy using comments here.


Related issues 2 (2 open0 closed)

Related to openSUSE admin - tickets #162401: falkor21.i.o.o freezes at POSTBlockedcrameleon2024-06-17

Actions
Follows openSUSE admin - tickets #162092: Prepare openSUSE:infrastructure* for 15.6In Progress2024-06-11

Actions
Actions #1

Updated by crameleon 15 days ago

  • Private changed from Yes to No
Actions #2

Updated by crameleon 15 days ago

  • Due date set to 2024-06-12
  • Start date changed from 2024-06-15 to 2024-06-12
  • Follows tickets #162092: Prepare openSUSE:infrastructure* for 15.6 added
Actions #3

Updated by crameleon 15 days ago

  • Due date deleted (2024-06-12)
  • Status changed from New to Blocked

Blocked by Kanidm missing from 15.6 distribution repositories.
Backports request was accepted already but seems not published yet.

Actions #5

Updated by crameleon 14 days ago

I tried speeding up the Kanidm problem by linking it from Factory to openSUSE:infrastructure until it exists in backports, but the package is broken and does not build with debuginfo (which is enabled in o:i): https://bugzilla.opensuse.org/show_bug.cgi?id=1222595.

Actions #6

Updated by crameleon 13 days ago

  • Status changed from Blocked to In Progress

Kanidm is still stuck in https://build.opensuse.org/request/show/1180364, but I worked around the problem in o:i for now by setting <debuginfo><disable/></debuginfo> in the linked kanidm package.
This allows us to zypper --releasever=15.6 dup --allowe-vendor-change (vendor change is necessary to switch from distribution Kanidm to the o:i one - when 1180364 is through we vendor change all installations back).

Actions #7

Updated by crameleon 13 days ago

  • Assignee set to crameleon
  • % Done changed from 0 to 10

Done:

  • orbit20.i.o.o + asgard1.i.o.o
  • orbit21.i.o.o + asgard2.i.o.o
Actions #8

Updated by crameleon 13 days ago

Done:

  • download.i.o.o
  • thor1.i.o.o
  • devcon.i.o.o
  • warp.i.o.o

Done with problems:

  • witch1.i.o.o:

=> after the upgrade, the Salt master on this machine no longer works properly, all state operations return:

[ERROR   ] The 'production' saltenv has no top file, and the fallback saltenv specified by default_top (production) also has no top file
local:
----------
          ID: states
    Function: no.None
      Result: False
     Comment: No Top file or master_tops data matches found. Please see master log for details.
     Changes:

Summary for local
------------
Succeeded: 0
Failed:    1
------------
Total states run:     1
Total run time:   0.000 ms
  • squanchy.i.o.o:

=> After the upgrade, I am locked out of the machine. Through Salt I run some commands:

root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'rcsshd status'
jid: 20240617181247828542
squanchy.infra.opensuse.org:
    * sshd.service - OpenSSH Daemon
         Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: disabled)
         Active: active (running) since Mon 2024-06-17 18:05:12 UTC; 7min ago
        Process: 16601 ExecStartPre=/usr/sbin/sshd-gen-keys-start (code=exited, status=0/SUCCESS)
        Process: 16645 ExecStartPre=/usr/sbin/sshd -t $SSHD_OPTS (code=exited, status=0/SUCCESS)
       Main PID: 16712 (sshd)
          Tasks: 1
            CPU: 784ms
         CGroup: /system.slice/sshd.service
                 `-16712 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

    Jun 17 18:09:24 squanchy sshd[25635]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:09:30 squanchy sshd[25641]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:09:52 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
    Jun 17 18:09:53 squanchy sshd[25656]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=2a07:de40:b27e:1201::3  user=root
    Jun 17 18:09:54 squanchy sshd[25651]: error: PAM: Authentication failure for root from 2a07:de40:b27e:1201::3
    Jun 17 18:09:54 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
    Jun 17 18:09:56 squanchy sshd[25651]: Connection closed by authenticating user root 2a07:de40:b27e:1201::3 port 40084 [preauth]
    Jun 17 18:10:34 squanchy sshd[25711]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:11:07 squanchy sshd[25733]: Connection closed by 2a07:de40:b27e:1100::a port 48210 [preauth]
    Jun 17 18:11:16 squanchy sshd[25739]: fatal: Access denied for user crameleon by PAM account configuration [preauth]

root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'systemctl status kanidm-unixd'                                                                              [109/202]
jid: 20240617181244168407
squanchy.infra.opensuse.org:
    * kanidm-unixd.service - Kanidm Local Client Resolver
         Loaded: loaded (/usr/lib/systemd/system/kanidm-unixd.service; enabled; preset: disabled)
         Active: active (running) since Mon 2024-06-17 18:12:39 UTC; 4s ago
       Main PID: 25810 (kanidm_unixd)
          Tasks: 4 (limit: 4915)
            CPU: 10.547s
         CGroup: /system.slice/kanidm-unixd.service
                 `-25810 /usr/sbin/kanidm_unixd

    Jun 17 18:12:29 squanchy systemd[1]: Starting Kanidm Local Client Resolver...
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN     🚧 [warn]: WARNING: DB folder /var/cache/kanidm-unixd has 'everyone' permissio
n bits in the mode. This could be a security risk ...
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: ERROR:tcti:src/tss2-tcti/tctildr.c:428:Tss2_TctiLdr_Initialize_Ex() Failed to instantiate TCTI
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 ERROR    🚨 [error]:  | tpm_err: TssError(Tcti(TctiReturnCode { base_error: NotSupporte
d }))
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN     🚧 [warn]: Unable to open requested tpm device, falling back to soft tpm | tpm
_err: TpmContextCreate
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 INFO     i [info]: Server started ...
    Jun 17 18:12:39 squanchy systemd[1]: Started Kanidm Local Client Resolver.

I sent it a restart of kanidm-unixd which did not help.

Actions #9

Updated by crameleon 13 days ago

  • witch1.i.o.o Salt problem solved with chown -R 477:479 /srv/salt-git, somehow this directory got everything recursively owned by root:root
Actions #10

Updated by crameleon 13 days ago

  • squanchy.i.o.o solved with state.apply from the Salt master, somehow our custom PAM configuration got nuked during the upgrade
Actions #11

Updated by crameleon 13 days ago · Edited

Actions #12

Updated by crameleon 13 days ago

Actions #13

Updated by crameleon 12 days ago

Upon temporarily repairing the falkor21 issue (which turned out to indeed be breakage caused by the 15.6 upgrade) I found it to have the same PAM issue - again, state.apply writes it again - but it does not seem right.

Actions #14

Updated by crameleon 12 days ago

With boo#1226497 I rather block all upgrades on physical machines.

Actions #15

Updated by firstyear 10 days ago

The problem is that while Kanidm was accepted here https://build.opensuse.org/request/show/1180285 it's not actually available yet. Because of this zypper considers it as needing removal:

Warning: Enforced setting: $releasever=15.6
Loading repository data...
Reading installed packages...
Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
Computing distribution upgrade...

The following 380 packages are going to be upgraded:
...
The following 5 packages are going to be REMOVED:
  kanidm-clients kanidm-unixd-clients libabsl2308_0_0 nfsidmap systemd-sysvinit

At this point I have no idea where the pipeline goes or how it works, so to me it's lost in the void. We'll need someone else to help find where it's stuck and why.

Actions #16

Updated by crameleon 10 days ago · Edited

@firstyear Your submission was accepted, yes, but not the release of the update: https://build.opensuse.org/request/show/1180364 (see my comment https://progress.opensuse.org/issues/162326?issue_count=403&issue_position=22&next_issue_id=162317&prev_issue_id=162329#note-6 which also includes my workaround).

Actions #17

Updated by crameleon 10 days ago

Actions #18

Updated by crameleon 10 days ago

  • provo-gate.i.o.o failed numad after upgrade:
Jun 20 16:04:16 provo-gate systemd[1]: Started numad - The NUMA daemon that manages application locality..
Jun 20 16:04:16 provo-gate numad[629]: Are CPUSETs enabled on this system?
Jun 20 16:04:16 provo-gate numad[629]: They are required for /usr/sbin/numad to function.
Jun 20 16:04:16 provo-gate numad[629]: Check manpage CPUSET(7). You might need to do something like:
Jun 20 16:04:16 provo-gate numad[629]:     # mkdir <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]:     # mount cgroup -t cgroup -o cpuset <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]:     where <DIRECTORY_MOUNT_POINT> is something like:
Jun 20 16:04:16 provo-gate numad[629]:       - /sys/fs/cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]:       - /cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]: and then try again...
Jun 20 16:04:16 provo-gate numad[629]: Or, use '-D <DIRECTORY_MOUNT_POINT>' to specify the correct mount point
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Main process exited, code=exited, status=1/FAILURE
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Failed with result 'exit-code'.

It run fine before, I made https://bugzilla.opensuse.org/show_bug.cgi?id=1226649.

Actions #19

Updated by crameleon 10 days ago

I tracked the Salt root:root permission problem down to rsync, and made https://bugzilla.opensuse.org/show_bug.cgi?id=1226656 because I cannot figure it out albeit trying different variations of --owner, --group, --super, --chown and the manual and changelog not indicating anything obvious. Using rsync over ssh from Tumbleweed, the options still work fine. It's either specific to 15.6 or the rsync:// protocol but I did not test further.

Actions #20

Updated by crameleon 10 days ago

PAM issue is due to pam-config being issued with --force during %post if /etc/pam.d/common-auth-pc is missing.

Needs to be corrected on these machines before the upgrade:

root@witch1 ~# salt --out-file=/dev/shm/auth-pc --out=text \*.infra.opensuse.org file.file_exists /etc/pam.d/common-auth-pc
root@witch1 ~# grep False /dev/shm/auth-pc
osc-collab.infra.opensuse.org: False
falkor22.infra.opensuse.org: False
ipx-narwal1.infra.opensuse.org: False
ipx-proxy1.infra.opensuse.org: False
nala.infra.opensuse.org: False
mirrorcache-us.infra.opensuse.org: False
narwal4.infra.opensuse.org: False
nala2.infra.opensuse.org: False
status2.infra.opensuse.org: False
mx4.infra.opensuse.org: False
login3.infra.opensuse.org: False
mirrorcache-us-db.infra.opensuse.org: False
provo-mirror.infra.opensuse.org: False
mx3.infra.opensuse.org: False
Actions #21

Updated by crameleon 9 days ago · Edited

Done:

  • provo-proxy1
  • provo-ns1
  • atlas1
  • atlas2
  • hel1
  • hel2
Actions #22

Updated by crameleon 9 days ago

monitor done (incl. lots of cleanup of old packages and repair of a Prometheus alert rule parsing error on kernel version changes).

Actions #23

Updated by crameleon 8 days ago · Edited

  • % Done changed from 10 to 20

Done:

  • prg-ns1
  • prg-ns2
  • mx1
  • mx2
  • mx-test

mx* needed removal of clamav from openSUSE:infrastructure (version in the distribution is now new enough), and a patch for mtail (for some reason, an additional system call is needed - since the mtail version did not change, maybe something in the default systemd syscall sets changed?): https://build.opensuse.org/request/show/1182631.

Same numad failure as earlier, but oddly only on mx2 - on mx1, numad started fine with the same version.

Actions #24

Updated by crameleon 7 days ago

  • % Done changed from 20 to 30

Done:

  • narwal{4,5,6,7,8}
  • ipx-narwal1
  • water{,3,4}
  • tsp
  • paste
  • mx{3,4}
  • svn
  • rpmlint
  • qsc-ns3
  • progressoo
  • calendar
  • netbox1
  • slimhat
  • pinot
  • opi-proxy
  • stonehat
  • status3

stonehat was a bit interesting as apparently the management address relies on a libvirt network which starts automatically, but only has its virtual interface created when at least one VM using the network is started - libvirt-guests seems to not have resumed the previously running VMs, requiring the need for console intervention (which was interesting too, since no passphrase was recorded in the store - I corrected this now) but in any case the machine is rather poorly configured, so probably not an upgrade issue (https://progress.opensuse.org/issues/151453).

Actions

Also available in: Atom PDF