Project

General

Profile

Actions

tickets #162326

open

Leap 15.6 upgrade diary

Added by crameleon 6 months ago. Updated 3 months ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2024-06-12
Due date:
% Done:

90%

Estimated time:

Description

Update all the Leap based machines from 15.5 to 15.6, track the progress and anything noteworthy using comments here.


Related issues 3 (1 open2 closed)

Related to openSUSE admin - tickets #162401: falkor21.i.o.o freezes at POSTResolvedcrameleon2024-06-17

Actions
Related to openSUSE admin - tickets #165425: obs-reviewlab.o.o down after upgradeResolvedhennevogel2024-08-17

Actions
Follows openSUSE admin - tickets #162092: Prepare openSUSE:infrastructure* for 15.6Workablecrameleon2024-06-11

Actions
Actions #1

Updated by crameleon 6 months ago

  • Private changed from Yes to No
Actions #2

Updated by crameleon 6 months ago

  • Due date set to 2024-06-12
  • Start date changed from 2024-06-15 to 2024-06-12
  • Follows tickets #162092: Prepare openSUSE:infrastructure* for 15.6 added
Actions #3

Updated by crameleon 6 months ago

  • Due date deleted (2024-06-12)
  • Status changed from New to Blocked

Blocked by Kanidm missing from 15.6 distribution repositories.
Backports request was accepted already but seems not published yet.

Actions #5

Updated by crameleon 6 months ago

I tried speeding up the Kanidm problem by linking it from Factory to openSUSE:infrastructure until it exists in backports, but the package is broken and does not build with debuginfo (which is enabled in o:i): https://bugzilla.opensuse.org/show_bug.cgi?id=1222595.

Actions #6

Updated by crameleon 6 months ago

  • Status changed from Blocked to In Progress

Kanidm is still stuck in https://build.opensuse.org/request/show/1180364, but I worked around the problem in o:i for now by setting <debuginfo><disable/></debuginfo> in the linked kanidm package.
This allows us to zypper --releasever=15.6 dup --allowe-vendor-change (vendor change is necessary to switch from distribution Kanidm to the o:i one - when 1180364 is through we vendor change all installations back).

Actions #7

Updated by crameleon 6 months ago

  • Assignee set to crameleon
  • % Done changed from 0 to 10

Done:

  • orbit20.i.o.o + asgard1.i.o.o
  • orbit21.i.o.o + asgard2.i.o.o
Actions #8

Updated by crameleon 6 months ago

Done:

  • download.i.o.o
  • thor1.i.o.o
  • devcon.i.o.o
  • warp.i.o.o

Done with problems:

  • witch1.i.o.o:

=> after the upgrade, the Salt master on this machine no longer works properly, all state operations return:

[ERROR   ] The 'production' saltenv has no top file, and the fallback saltenv specified by default_top (production) also has no top file
local:
----------
          ID: states
    Function: no.None
      Result: False
     Comment: No Top file or master_tops data matches found. Please see master log for details.
     Changes:

Summary for local
------------
Succeeded: 0
Failed:    1
------------
Total states run:     1
Total run time:   0.000 ms
  • squanchy.i.o.o:

=> After the upgrade, I am locked out of the machine. Through Salt I run some commands:

root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'rcsshd status'
jid: 20240617181247828542
squanchy.infra.opensuse.org:
    * sshd.service - OpenSSH Daemon
         Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: disabled)
         Active: active (running) since Mon 2024-06-17 18:05:12 UTC; 7min ago
        Process: 16601 ExecStartPre=/usr/sbin/sshd-gen-keys-start (code=exited, status=0/SUCCESS)
        Process: 16645 ExecStartPre=/usr/sbin/sshd -t $SSHD_OPTS (code=exited, status=0/SUCCESS)
       Main PID: 16712 (sshd)
          Tasks: 1
            CPU: 784ms
         CGroup: /system.slice/sshd.service
                 `-16712 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

    Jun 17 18:09:24 squanchy sshd[25635]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:09:30 squanchy sshd[25641]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:09:52 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
    Jun 17 18:09:53 squanchy sshd[25656]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=2a07:de40:b27e:1201::3  user=root
    Jun 17 18:09:54 squanchy sshd[25651]: error: PAM: Authentication failure for root from 2a07:de40:b27e:1201::3
    Jun 17 18:09:54 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
    Jun 17 18:09:56 squanchy sshd[25651]: Connection closed by authenticating user root 2a07:de40:b27e:1201::3 port 40084 [preauth]
    Jun 17 18:10:34 squanchy sshd[25711]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:11:07 squanchy sshd[25733]: Connection closed by 2a07:de40:b27e:1100::a port 48210 [preauth]
    Jun 17 18:11:16 squanchy sshd[25739]: fatal: Access denied for user crameleon by PAM account configuration [preauth]

root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'systemctl status kanidm-unixd'                                                                              [109/202]
jid: 20240617181244168407
squanchy.infra.opensuse.org:
    * kanidm-unixd.service - Kanidm Local Client Resolver
         Loaded: loaded (/usr/lib/systemd/system/kanidm-unixd.service; enabled; preset: disabled)
         Active: active (running) since Mon 2024-06-17 18:12:39 UTC; 4s ago
       Main PID: 25810 (kanidm_unixd)
          Tasks: 4 (limit: 4915)
            CPU: 10.547s
         CGroup: /system.slice/kanidm-unixd.service
                 `-25810 /usr/sbin/kanidm_unixd

    Jun 17 18:12:29 squanchy systemd[1]: Starting Kanidm Local Client Resolver...
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN     🚧 [warn]: WARNING: DB folder /var/cache/kanidm-unixd has 'everyone' permissio
n bits in the mode. This could be a security risk ...
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: ERROR:tcti:src/tss2-tcti/tctildr.c:428:Tss2_TctiLdr_Initialize_Ex() Failed to instantiate TCTI
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 ERROR    🚨 [error]:  | tpm_err: TssError(Tcti(TctiReturnCode { base_error: NotSupporte
d }))
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN     🚧 [warn]: Unable to open requested tpm device, falling back to soft tpm | tpm
_err: TpmContextCreate
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 INFO     i [info]: Server started ...
    Jun 17 18:12:39 squanchy systemd[1]: Started Kanidm Local Client Resolver.

I sent it a restart of kanidm-unixd which did not help.

Actions #9

Updated by crameleon 6 months ago

  • witch1.i.o.o Salt problem solved with chown -R 477:479 /srv/salt-git, somehow this directory got everything recursively owned by root:root
Actions #10

Updated by crameleon 6 months ago

  • squanchy.i.o.o solved with state.apply from the Salt master, somehow our custom PAM configuration got nuked during the upgrade
Actions #11

Updated by crameleon 6 months ago · Edited

Actions #12

Updated by crameleon 6 months ago

Actions #13

Updated by crameleon 6 months ago

Upon temporarily repairing the falkor21 issue (which turned out to indeed be breakage caused by the 15.6 upgrade) I found it to have the same PAM issue - again, state.apply writes it again - but it does not seem right.

Actions #14

Updated by crameleon 6 months ago

With boo#1226497 I rather block all upgrades on physical machines.

Actions #15

Updated by firstyear 6 months ago

The problem is that while Kanidm was accepted here https://build.opensuse.org/request/show/1180285 it's not actually available yet. Because of this zypper considers it as needing removal:

Warning: Enforced setting: $releasever=15.6
Loading repository data...
Reading installed packages...
Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
Computing distribution upgrade...

The following 380 packages are going to be upgraded:
...
The following 5 packages are going to be REMOVED:
  kanidm-clients kanidm-unixd-clients libabsl2308_0_0 nfsidmap systemd-sysvinit

At this point I have no idea where the pipeline goes or how it works, so to me it's lost in the void. We'll need someone else to help find where it's stuck and why.

Actions #16

Updated by crameleon 6 months ago · Edited

@firstyear Your submission was accepted, yes, but not the release of the update: https://build.opensuse.org/request/show/1180364 (see my comment https://progress.opensuse.org/issues/162326?issue_count=403&issue_position=22&next_issue_id=162317&prev_issue_id=162329#note-6 which also includes my workaround).

Actions #17

Updated by crameleon 6 months ago

Actions #18

Updated by crameleon 6 months ago

  • provo-gate.i.o.o failed numad after upgrade:
Jun 20 16:04:16 provo-gate systemd[1]: Started numad - The NUMA daemon that manages application locality..
Jun 20 16:04:16 provo-gate numad[629]: Are CPUSETs enabled on this system?
Jun 20 16:04:16 provo-gate numad[629]: They are required for /usr/sbin/numad to function.
Jun 20 16:04:16 provo-gate numad[629]: Check manpage CPUSET(7). You might need to do something like:
Jun 20 16:04:16 provo-gate numad[629]:     # mkdir <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]:     # mount cgroup -t cgroup -o cpuset <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]:     where <DIRECTORY_MOUNT_POINT> is something like:
Jun 20 16:04:16 provo-gate numad[629]:       - /sys/fs/cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]:       - /cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]: and then try again...
Jun 20 16:04:16 provo-gate numad[629]: Or, use '-D <DIRECTORY_MOUNT_POINT>' to specify the correct mount point
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Main process exited, code=exited, status=1/FAILURE
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Failed with result 'exit-code'.

It run fine before, I made https://bugzilla.opensuse.org/show_bug.cgi?id=1226649.

Actions #19

Updated by crameleon 6 months ago

I tracked the Salt root:root permission problem down to rsync, and made https://bugzilla.opensuse.org/show_bug.cgi?id=1226656 because I cannot figure it out albeit trying different variations of --owner, --group, --super, --chown and the manual and changelog not indicating anything obvious. Using rsync over ssh from Tumbleweed, the options still work fine. It's either specific to 15.6 or the rsync:// protocol but I did not test further.

Actions #20

Updated by crameleon 6 months ago

PAM issue is due to pam-config being issued with --force during %post if /etc/pam.d/common-auth-pc is missing.

Needs to be corrected on these machines before the upgrade:

root@witch1 ~# salt --out-file=/dev/shm/auth-pc --out=text \*.infra.opensuse.org file.file_exists /etc/pam.d/common-auth-pc
root@witch1 ~# grep False /dev/shm/auth-pc
osc-collab.infra.opensuse.org: False
falkor22.infra.opensuse.org: False
ipx-narwal1.infra.opensuse.org: False
ipx-proxy1.infra.opensuse.org: False
nala.infra.opensuse.org: False
mirrorcache-us.infra.opensuse.org: False
narwal4.infra.opensuse.org: False
nala2.infra.opensuse.org: False
status2.infra.opensuse.org: False
mx4.infra.opensuse.org: False
login3.infra.opensuse.org: False
mirrorcache-us-db.infra.opensuse.org: False
provo-mirror.infra.opensuse.org: False
mx3.infra.opensuse.org: False
Actions #21

Updated by crameleon 6 months ago · Edited

Done:

  • provo-proxy1
  • provo-ns1
  • atlas1
  • atlas2
  • hel1
  • hel2
Actions #22

Updated by crameleon 6 months ago

monitor done (incl. lots of cleanup of old packages and repair of a Prometheus alert rule parsing error on kernel version changes).

Actions #23

Updated by crameleon 6 months ago · Edited

  • % Done changed from 10 to 20

Done:

  • prg-ns1
  • prg-ns2
  • mx1
  • mx2
  • mx-test

mx* needed removal of clamav from openSUSE:infrastructure (version in the distribution is now new enough), and a patch for mtail (for some reason, an additional system call is needed - since the mtail version did not change, maybe something in the default systemd syscall sets changed?): https://build.opensuse.org/request/show/1182631.

Same numad failure as earlier, but oddly only on mx2 - on mx1, numad started fine with the same version.

Actions #24

Updated by crameleon 6 months ago

  • % Done changed from 20 to 30

Done:

  • narwal{4,5,6,7,8}
  • ipx-narwal1
  • water{,3,4}
  • tsp
  • paste
  • mx{3,4}
  • svn
  • rpmlint
  • qsc-ns3
  • progressoo
  • calendar
  • netbox1
  • slimhat
  • pinot
  • opi-proxy
  • stonehat
  • status3

stonehat was a bit interesting as apparently the management address relies on a libvirt network which starts automatically, but only has its virtual interface created when at least one VM using the network is started - libvirt-guests seems to not have resumed the previously running VMs, requiring the need for console intervention (which was interesting too, since no passphrase was recorded in the store - I corrected this now) but in any case the machine is rather poorly configured, so probably not an upgrade issue (https://progress.opensuse.org/issues/151453).

Actions #25

Updated by crameleon 5 months ago

Made bug for stonehat libvirt issue: https://bugzilla.opensuse.org/show_bug.cgi?id=1228073.
There is another problem on stonehat, every few days it stops routing any network packets (still has correct addresses and routes configured, but all network activity is broken, ping-ing anywhere fails) - requiring a reboot to work again (just restarting network.service does not help).

Actions #26

Updated by crameleon 5 months ago

  • falkor2{0,2} upgraded as well, because the mismatching version with falkor21 already upgraded started causing issues - added the 25_bli patch, and we need to be careful upon reboots to remount /kvm - the ARP problem could probably be mitigated by switching the NFS connection to IPv6 (which I wanted to do since some time already anyways, since it's the only remaining legacy IP connectivity on the clusters).
Actions #27

Updated by crameleon 4 months ago

  • % Done changed from 30 to 60

Done:

  • community2
  • matomo
  • limesurvey
  • odin
  • minio
  • metrics
  • backup
  • nala
  • login3
  • acme
  • mybackup
Actions #28

Updated by crameleon 4 months ago

Done:

  • nala2
  • kubic
  • lnt
  • ipx-proxy1
  • elections2
Actions #29

Updated by crameleon 4 months ago

Done:

  • mirrordb{1,2}
Actions #30

Updated by crameleon 4 months ago

Done:

  • galera{1,2,3}

.. including a large cleanup of packages.

Actions #31

Updated by crameleon 4 months ago · Edited

  • % Done changed from 60 to 70

Done:

  • obsreview

no HTTP service after upgrade: https://progress.opensuse.org/issues/165425.

Actions #32

Updated by crameleon 4 months ago

Done:

  • pagure01

Services failed with import errors after the upgrade, due Leap 15.6 getting a new pygit2 version which is victim of https://github.com/libgit2/pygit2/commit/a8b2421bea55029296cc79ac7c1518b9885d8a6f. Hotpatched for now, and submitted the already existing upstream patch (pagure.git@8a1a7ba9f789ba446bab63783f7b963246861cb8) to our package: https://build.opensuse.org/request/show/1194456.

Actions #33

Updated by crameleon 4 months ago

Done:

  • provo-mirror

.. also including a large package cleanup.

Actions #34

Updated by crameleon 4 months ago

Done:

  • mirrorcache-us-db.i.o.o

Aborted mirrorcache-us.i.o.o due to

Detected 8 file conflicts:

File /usr/lib/perl5/vendor_perl/5.26.1/Time/CTime.pm
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/lib/perl5/vendor_perl/5.26.1/Time/ParseDate.pm
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/lib/perl5/vendor_perl/5.26.1/Time/Timezone.pm
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::CTime.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::DaysInMonth.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::JulianDay.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::ParseDate.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::Timezone.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File conflicts happen when two packages attempt to install files with the same name but different contents. If you continue, conflicting files will be replaced losing the previous content.
Continue? [yes/no] (no): no

Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
History:
 - ABORT request:

Please see the above error message for a hint.

@andriinikitin ^ can the problematic packages from repo-oss be removed?

Actions #35

Updated by crameleon 4 months ago

Actions #36

Updated by crameleon 4 months ago · Edited

  • % Done changed from 70 to 80

Done:

  • kani-test
Actions #37

Updated by andriinikitin 4 months ago

crameleon wrote in #note-34:

@andriinikitin ^ can the problematic packages from repo-oss be removed?

Hej sorry for delay, not sure what is wrong with my notifications.

Yes, I have checked and pontifex has only perl-Time-modules so perl-Time-ParseDate can be removed.
Let me know if I should do it.

Actions #38

Updated by crameleon 4 months ago · Edited

Thanks for checking!

Done:

  • mirrorcache-us
Actions #39

Updated by crameleon 4 months ago · Edited

  • % Done changed from 80 to 90

Done:

  • status1

MariaDB would time out upon starting as the schema upgrade took a long time. The default unit has TimeoutSec=300. I first tried with TimeoutStartSec=600, but it was not enough. Eventually, TimeoutStartSec=7200 gave it enough time.

Actions #40

Updated by crameleon 3 months ago

  • Status changed from In Progress to Blocked

Pending:

Actions

Also available in: Atom PDF