tickets #162326: Leap 15.6 upgrade diary - openSUSE admin - openSUSE Project Management Tool

I tried speeding up the Kanidm problem by linking it from Factory to openSUSE:infrastructure until it exists in backports, but the package is broken and does not build with debuginfo (which is enabled in o:i): https://bugzilla.opensuse.org/show_bug.cgi?id=1222595.

Actions

Copy link

#6

Updated by crameleon 12 months ago

Status changed from Blocked to In Progress

Kanidm is still stuck in https://build.opensuse.org/request/show/1180364, but I worked around the problem in o:i for now by setting <debuginfo><disable/></debuginfo> in the linked kanidm package.
This allows us to zypper --releasever=15.6 dup --allowe-vendor-change (vendor change is necessary to switch from distribution Kanidm to the o:i one - when 1180364 is through we vendor change all installations back).

Actions

Copy link

#7

Updated by crameleon 12 months ago

Assignee set to crameleon
% Done changed from 0 to 10

Done:

orbit20.i.o.o + asgard1.i.o.o
orbit21.i.o.o + asgard2.i.o.o

Actions

Copy link

#8

Updated by crameleon 12 months ago

Done:

download.i.o.o
thor1.i.o.o
devcon.i.o.o
warp.i.o.o

Done with problems:

witch1.i.o.o:

=> after the upgrade, the Salt master on this machine no longer works properly, all state operations return:

[ERROR   ] The 'production' saltenv has no top file, and the fallback saltenv specified by default_top (production) also has no top file
local:
----------
          ID: states
    Function: no.None
      Result: False
     Comment: No Top file or master_tops data matches found. Please see master log for details.
     Changes:

Summary for local
------------
Succeeded: 0
Failed:    1
------------
Total states run:     1
Total run time:   0.000 ms

squanchy.i.o.o:

=> After the upgrade, I am locked out of the machine. Through Salt I run some commands:

root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'rcsshd status'
jid: 20240617181247828542
squanchy.infra.opensuse.org:
    * sshd.service - OpenSSH Daemon
         Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: disabled)
         Active: active (running) since Mon 2024-06-17 18:05:12 UTC; 7min ago
        Process: 16601 ExecStartPre=/usr/sbin/sshd-gen-keys-start (code=exited, status=0/SUCCESS)
        Process: 16645 ExecStartPre=/usr/sbin/sshd -t $SSHD_OPTS (code=exited, status=0/SUCCESS)
       Main PID: 16712 (sshd)
          Tasks: 1
            CPU: 784ms
         CGroup: /system.slice/sshd.service
                 `-16712 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

    Jun 17 18:09:24 squanchy sshd[25635]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:09:30 squanchy sshd[25641]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:09:52 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
    Jun 17 18:09:53 squanchy sshd[25656]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=2a07:de40:b27e:1201::3  user=root
    Jun 17 18:09:54 squanchy sshd[25651]: error: PAM: Authentication failure for root from 2a07:de40:b27e:1201::3
    Jun 17 18:09:54 squanchy sshd[25651]: Postponed keyboard-interactive for root from 2a07:de40:b27e:1201::3 port 40084 ssh2 [preauth]
    Jun 17 18:09:56 squanchy sshd[25651]: Connection closed by authenticating user root 2a07:de40:b27e:1201::3 port 40084 [preauth]
    Jun 17 18:10:34 squanchy sshd[25711]: fatal: Access denied for user crameleon by PAM account configuration [preauth]
    Jun 17 18:11:07 squanchy sshd[25733]: Connection closed by 2a07:de40:b27e:1100::a port 48210 [preauth]
    Jun 17 18:11:16 squanchy sshd[25739]: fatal: Access denied for user crameleon by PAM account configuration [preauth]

root@witch1 ~# salt squanchy.infra.opensuse.org cmd.run 'systemctl status kanidm-unixd'                                                                              [109/202]
jid: 20240617181244168407
squanchy.infra.opensuse.org:
    * kanidm-unixd.service - Kanidm Local Client Resolver
         Loaded: loaded (/usr/lib/systemd/system/kanidm-unixd.service; enabled; preset: disabled)
         Active: active (running) since Mon 2024-06-17 18:12:39 UTC; 4s ago
       Main PID: 25810 (kanidm_unixd)
          Tasks: 4 (limit: 4915)
            CPU: 10.547s
         CGroup: /system.slice/kanidm-unixd.service
                 `-25810 /usr/sbin/kanidm_unixd

    Jun 17 18:12:29 squanchy systemd[1]: Starting Kanidm Local Client Resolver...
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN     🚧 [warn]: WARNING: DB folder /var/cache/kanidm-unixd has 'everyone' permissio
n bits in the mode. This could be a security risk ...
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: ERROR:tcti:src/tss2-tcti/tctildr.c:428:Tss2_TctiLdr_Initialize_Ex() Failed to instantiate TCTI
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 ERROR    🚨 [error]:  | tpm_err: TssError(Tcti(TctiReturnCode { base_error: NotSupporte
d }))
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 WARN     🚧 [warn]: Unable to open requested tpm device, falling back to soft tpm | tpm
_err: TpmContextCreate
    Jun 17 18:12:39 squanchy kanidm_unixd[25810]: 00000000-0000-0000-0000-000000000000 INFO     ｉ [info]: Server started ...
    Jun 17 18:12:39 squanchy systemd[1]: Started Kanidm Local Client Resolver.

I sent it a restart of kanidm-unixd which did not help.

Actions

Copy link

#9

Updated by crameleon 12 months ago

witch1.i.o.o Salt problem solved with chown -R 477:479 /srv/salt-git, somehow this directory got everything recursively owned by root:root

Actions

Copy link

#10

Updated by crameleon 12 months ago

squanchy.i.o.o solved with state.apply from the Salt master, somehow our custom PAM configuration got nuked during the upgrade

Actions

Copy link

#11

Updated by crameleon 12 months ago · Edited

falkor21.i.o.o dead after upgrade, freezes at POST when booting from the default boot entry, following up in separate ticket: https://progress.opensuse.org/issues/162401.

Actions

Copy link

#12

Updated by crameleon 12 months ago

Related to tickets #162401: falkor21.i.o.o freezes at POST added

Actions

Copy link

#13

Updated by crameleon 12 months ago

Upon temporarily repairing the falkor21 issue (which turned out to indeed be breakage caused by the 15.6 upgrade) I found it to have the same PAM issue - again, state.apply writes it again - but it does not seem right.

Actions

Copy link

#14

Updated by crameleon 12 months ago

With boo#1226497 I rather block all upgrades on physical machines.

Actions

Copy link

#15

Updated by firstyear 12 months ago

The problem is that while Kanidm was accepted here https://build.opensuse.org/request/show/1180285 it's not actually available yet. Because of this zypper considers it as needing removal:

Warning: Enforced setting: $releasever=15.6
Loading repository data...
Reading installed packages...
Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
Computing distribution upgrade...

The following 380 packages are going to be upgraded:
...
The following 5 packages are going to be REMOVED:
  kanidm-clients kanidm-unixd-clients libabsl2308_0_0 nfsidmap systemd-sysvinit

At this point I have no idea where the pipeline goes or how it works, so to me it's lost in the void. We'll need someone else to help find where it's stuck and why.

Actions

Copy link

#16

Updated by crameleon 12 months ago · Edited

@firstyear Your submission was accepted, yes, but not the release of the update: https://build.opensuse.org/request/show/1180364 (see my comment https://progress.opensuse.org/issues/162326?issue_count=403&issue_position=22&next_issue_id=162317&prev_issue_id=162329#note-6 which also includes my workaround).

Actions

Copy link

#17

Updated by crameleon 12 months ago

I made https://bugzilla.opensuse.org/show_bug.cgi?id=1226639 for the PAM issue now.

Actions

Copy link

#18

Updated by crameleon 12 months ago

provo-gate.i.o.o failed numad after upgrade:

Jun 20 16:04:16 provo-gate systemd[1]: Started numad - The NUMA daemon that manages application locality..
Jun 20 16:04:16 provo-gate numad[629]: Are CPUSETs enabled on this system?
Jun 20 16:04:16 provo-gate numad[629]: They are required for /usr/sbin/numad to function.
Jun 20 16:04:16 provo-gate numad[629]: Check manpage CPUSET(7). You might need to do something like:
Jun 20 16:04:16 provo-gate numad[629]:     # mkdir <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]:     # mount cgroup -t cgroup -o cpuset <DIRECTORY_MOUNT_POINT>
Jun 20 16:04:16 provo-gate numad[629]:     where <DIRECTORY_MOUNT_POINT> is something like:
Jun 20 16:04:16 provo-gate numad[629]:       - /sys/fs/cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]:       - /cgroup/cpuset
Jun 20 16:04:16 provo-gate numad[629]: and then try again...
Jun 20 16:04:16 provo-gate numad[629]: Or, use '-D <DIRECTORY_MOUNT_POINT>' to specify the correct mount point
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Main process exited, code=exited, status=1/FAILURE
Jun 20 16:04:16 provo-gate systemd[1]: numad.service: Failed with result 'exit-code'.

It run fine before, I made https://bugzilla.opensuse.org/show_bug.cgi?id=1226649.

Actions

Copy link

#19

Updated by crameleon 12 months ago

I tracked the Salt root:root permission problem down to rsync, and made https://bugzilla.opensuse.org/show_bug.cgi?id=1226656 because I cannot figure it out albeit trying different variations of --owner, --group, --super, --chown and the manual and changelog not indicating anything obvious. Using rsync over ssh from Tumbleweed, the options still work fine. It's either specific to 15.6 or the rsync:// protocol but I did not test further.

Actions

Copy link

#20

Updated by crameleon 12 months ago

PAM issue is due to pam-config being issued with --force during %post if /etc/pam.d/common-auth-pc is missing.

Needs to be corrected on these machines before the upgrade:

root@witch1 ~# salt --out-file=/dev/shm/auth-pc --out=text \*.infra.opensuse.org file.file_exists /etc/pam.d/common-auth-pc
root@witch1 ~# grep False /dev/shm/auth-pc
osc-collab.infra.opensuse.org: False
falkor22.infra.opensuse.org: False
ipx-narwal1.infra.opensuse.org: False
ipx-proxy1.infra.opensuse.org: False
nala.infra.opensuse.org: False
mirrorcache-us.infra.opensuse.org: False
narwal4.infra.opensuse.org: False
nala2.infra.opensuse.org: False
status2.infra.opensuse.org: False
mx4.infra.opensuse.org: False
login3.infra.opensuse.org: False
mirrorcache-us-db.infra.opensuse.org: False
provo-mirror.infra.opensuse.org: False
mx3.infra.opensuse.org: False

Actions

Copy link

#21

Updated by crameleon 12 months ago · Edited

Done:

provo-proxy1
provo-ns1
atlas1
atlas2
hel1
hel2

Actions

Copy link

#22

Updated by crameleon 12 months ago

monitor done (incl. lots of cleanup of old packages and repair of a Prometheus alert rule parsing error on kernel version changes).

Actions

Copy link

#23

Updated by crameleon 12 months ago · Edited

% Done changed from 10 to 20

Done:

prg-ns1
prg-ns2
mx1
mx2
mx-test

mx* needed removal of clamav from openSUSE:infrastructure (version in the distribution is now new enough), and a patch for mtail (for some reason, an additional system call is needed - since the mtail version did not change, maybe something in the default systemd syscall sets changed?): https://build.opensuse.org/request/show/1182631.

Same numad failure as earlier, but oddly only on mx2 - on mx1, numad started fine with the same version.

Actions

Copy link

#24

Updated by crameleon 12 months ago

% Done changed from 20 to 30

Done:

narwal{4,5,6,7,8}
ipx-narwal1
water{,3,4}
tsp
paste
mx{3,4}
svn
rpmlint
qsc-ns3
progressoo
calendar
netbox1
slimhat
pinot
opi-proxy
stonehat
status3

stonehat was a bit interesting as apparently the management address relies on a libvirt network which starts automatically, but only has its virtual interface created when at least one VM using the network is started - libvirt-guests seems to not have resumed the previously running VMs, requiring the need for console intervention (which was interesting too, since no passphrase was recorded in the store - I corrected this now) but in any case the machine is rather poorly configured, so probably not an upgrade issue (https://progress.opensuse.org/issues/151453).

Actions

Copy link

#25

Updated by crameleon 11 months ago

Made bug for stonehat libvirt issue: https://bugzilla.opensuse.org/show_bug.cgi?id=1228073.
There is another problem on stonehat, every few days it stops routing any network packets (still has correct addresses and routes configured, but all network activity is broken, ping-ing anywhere fails) - requiring a reboot to work again (just restarting network.service does not help).

Actions

Copy link

#26

Updated by crameleon 10 months ago

falkor2{0,2} upgraded as well, because the mismatching version with falkor21 already upgraded started causing issues - added the 25_bli patch, and we need to be careful upon reboots to remount /kvm - the ARP problem could probably be mitigated by switching the NFS connection to IPv6 (which I wanted to do since some time already anyways, since it's the only remaining legacy IP connectivity on the clusters).

Actions

Copy link

#27

Updated by crameleon 10 months ago

% Done changed from 30 to 60

Done:

community2
matomo
limesurvey
odin
minio
metrics
backup
nala
login3
acme
mybackup

Actions

Copy link

#28

Updated by crameleon 10 months ago

Done:

nala2
kubic
lnt
ipx-proxy1
elections2

Actions

Copy link

#29

Updated by crameleon 10 months ago

Done:

mirrordb{1,2}

Actions

Copy link

#30

Updated by crameleon 10 months ago

Done:

galera{1,2,3}

.. including a large cleanup of packages.

Actions

Copy link

#31

Updated by crameleon 10 months ago · Edited

% Done changed from 60 to 70

Done:

obsreview

no HTTP service after upgrade: https://progress.opensuse.org/issues/165425.

Actions

Copy link

#32

Updated by crameleon 10 months ago

Done:

pagure01

Services failed with import errors after the upgrade, due Leap 15.6 getting a new pygit2 version which is victim of https://github.com/libgit2/pygit2/commit/a8b2421bea55029296cc79ac7c1518b9885d8a6f. Hotpatched for now, and submitted the already existing upstream patch (pagure.git@8a1a7ba9f789ba446bab63783f7b963246861cb8) to our package: https://build.opensuse.org/request/show/1194456.

Actions

Copy link

#33

Updated by crameleon 10 months ago

Done:

provo-mirror

.. also including a large package cleanup.

Actions

Copy link

#34

Updated by crameleon 10 months ago

Done:

mirrorcache-us-db.i.o.o

Aborted mirrorcache-us.i.o.o due to

Detected 8 file conflicts:

File /usr/lib/perl5/vendor_perl/5.26.1/Time/CTime.pm
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/lib/perl5/vendor_perl/5.26.1/Time/ParseDate.pm
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/lib/perl5/vendor_perl/5.26.1/Time/Timezone.pm
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::CTime.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::DaysInMonth.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::JulianDay.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::ParseDate.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File /usr/share/man/man3/Time::Timezone.3pm.gz
  from install of
     perl-Time-modules-2013.0912-bp156.3.1.x86_64 (repo-oss)
  conflicts with file from install of
     perl-Time-ParseDate-2015.103-lp156.3.1.noarch (mirrorcache)

File conflicts happen when two packages attempt to install files with the same name but different contents. If you continue, conflicting files will be replaced losing the previous content.
Continue? [yes/no] (no): no

Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
History:
 - ABORT request:

Please see the above error message for a hint.

@andriinikitin ^ can the problematic packages from repo-oss be removed?

Actions

Copy link

#35

Updated by crameleon 10 months ago

Related to tickets #165425: obs-reviewlab.o.o down after upgrade added

Actions

Copy link

#36

Updated by crameleon 10 months ago · Edited

% Done changed from 70 to 80

Done:

kani-test

Actions

Copy link

#37

Updated by andriinikitin 9 months ago

crameleon wrote in #note-34:

@andriinikitin ^ can the problematic packages from repo-oss be removed?

Hej sorry for delay, not sure what is wrong with my notifications.

Yes, I have checked and pontifex has only perl-Time-modules so perl-Time-ParseDate can be removed.
Let me know if I should do it.

Actions

Copy link

#38

Updated by crameleon 9 months ago · Edited

Thanks for checking!

Done:

mirrorcache-us

Actions

Copy link

#39

Updated by crameleon 9 months ago · Edited

% Done changed from 80 to 90

Done:

status1

MariaDB would time out upon starting as the schema upgrade took a long time. The default unit has TimeoutSec=300. I first tried with TimeoutStartSec=600, but it was not enough. Eventually, TimeoutStartSec=7200 gave it enough time.

Actions

Copy link

#40

Updated by crameleon 9 months ago

Status changed from In Progress to Blocked

Pending:

dale => delegated to @hennevogel
riesling => delegated to @cboltz

Actions

Copy link

#41

Updated by crameleon 5 months ago

Status changed from Blocked to In Progress
% Done changed from 90 to 100

Actions

Copy link

#42

Updated by crameleon 5 months ago

Status changed from In Progress to Resolved

Project

General

Profile

openSUSE admin

Tags

Custom queries

tickets #162326

Leap 15.6 upgrade diary

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago · Edited

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by firstyear 12 months ago

Updated by crameleon 12 months ago · Edited

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago · Edited

Updated by crameleon 12 months ago

Updated by crameleon 12 months ago · Edited

Updated by crameleon 12 months ago

Updated by crameleon 11 months ago

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago · Edited

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago

Updated by crameleon 10 months ago · Edited

Updated by andriinikitin 9 months ago

Updated by crameleon 9 months ago · Edited

Updated by crameleon 9 months ago · Edited

Updated by crameleon 9 months ago

Updated by crameleon 5 months ago

Updated by crameleon 5 months ago