Project

General

Profile

Actions

action #90170

closed

Service for purging old kernels might run while system management is locked and fail

Added by okurz about 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2021-03-16
Due date:
2021-07-06
% Done:

0%

Estimated time:

Description

Observation

Today observed on openqaworker3 when checking for the reason of the failed systemd services alert:

martchus@openqaworker3:/srv/salt> sudo systemctl status purge-kernels.service
purge-kernels.service - Purge old kernels
   Loaded: loaded (/usr/lib/systemd/system/purge-kernels.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2021-03-15 06:51:54 CET; 1 day 4h ago
 Main PID: 1166 (code=exited, status=7)

Mar 15 06:51:36 openqaworker3 systemd[1]: Starting Purge old kernels...
Mar 15 06:51:54 openqaworker3 zypper[1166]: System management is locked by the application with pid 1286 (zypper).
Mar 15 06:51:54 openqaworker3 zypper[1166]: Close this application before trying again.
Mar 15 06:51:54 openqaworker3 systemd[1]: purge-kernels.service: Main process exited, code=exited, status=7/NOTRUNNING
Mar 15 06:51:54 openqaworker3 systemd[1]: Failed to start Purge old kernels.
Mar 15 06:51:54 openqaworker3 systemd[1]: purge-kernels.service: Unit entered failed state.
Mar 15 06:51:54 openqaworker3 systemd[1]: purge-kernels.service: Failed with result 'exit-code'.

Of course it helps to simply restart the service. Not sure how we could further improve this. This is likely just a caveat of openSUSE's purge-kernels-service package which provides that service (which likely comes from https://github.com/openSUSE/mkinitrd).

Acceptance criteria

  • AC1: The systemd service purge-kernels.service does not fail if zypper is running for a short time

Suggestion

Report this issue upstream as bug and in the meantime apply a workaround for us, e.g. systemd service override with retry.

Actions #2

Updated by okurz about 3 years ago

  • Tags set to alert, systemd, service, zypper, infrastructure
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Low to Normal
  • Target version set to Ready
Actions #3

Updated by mkittler about 3 years ago

I've created an upstream issue: https://github.com/openSUSE/mkinitrd/issues/43

Likely the best one can do is adding a retry. I suppose that would mean that the service would still be in the state "failed" between the retries so our monitoring would still trigger an alert.

Actions #4

Updated by okurz about 3 years ago

mkittler wrote:

I've created an upstream issue: https://github.com/openSUSE/mkinitrd/issues/43

https://github.com/openSUSE/mkinitrd/issues/43 was closed referring to bugzilla. So please report there.

Likely the best one can do is adding a retry. I suppose that would mean that the service would still be in the state "failed" between the retries so our monitoring would still trigger an alert.

Not if the alerting grace time is longer than the retry interval.

Actions #6

Updated by okurz about 3 years ago

Thank you for the bug report. Now we can implement a workaround for ourselves.

Actions #7

Updated by livdywan almost 3 years ago

We could check that zypper isn't in locked state:

           7 - ZYPPER_EXIT_ZYPP_LOCKED
           The ZYPP library is locked, e.g. packagekit is running.

Might be nicer than just Retry?

Actions #8

Updated by okurz almost 3 years ago

  • Priority changed from Normal to Low
Actions #9

Updated by okurz almost 3 years ago

  • Target version changed from Ready to future
Actions #10

Updated by livdywan almost 3 years ago

So apparently ZYPP_LOCK_TIMEOUT=-1 is now a supported option (see https://github.com/openSUSE/libzypp/pull/314) and boo#1184399 is fixed.

I don't know if there's a fix or bug for the purge kernel services yet 🤔️

Actions #11

Updated by livdywan almost 3 years ago

cdywan wrote:

I don't know if there's a fix or bug for the purge kernel services yet 🤔️

https://build.opensuse.org/request/show/891053

Actions #12

Updated by livdywan almost 3 years ago

  • Status changed from Workable to Feedback
  • Assignee set to livdywan

Since the package got updated in the meanwhile I went ahead and re-enabled it:

> zypper if purge-kernels-service
Repository 'Main Update Repository' is out-of-date. You can run 'zypper refresh' as root to update it.
Loading repository data...
Reading installed packages...


Information for package purge-kernels-service:
----------------------------------------------
Repository     : Main Update Repository
Name           : purge-kernels-service
Version        : 0-lp152.5.3.1
Arch           : noarch
Vendor         : openSUSE
Installed Size : 346 B
Installed      : Yes (automatically)
Status         : up-to-date
Source package : purge-kernels-service-0-lp152.5.3.1.src
Summary        : The service for removing old kernels when multiversion is enabled
Description    : 
    This service runs zypper purge-kernels on boot after a kernel package was
    installed.
> grep ZYPP_LOCK_TIMEOUT /usr/lib/systemd/system/purge-kernels.service
Environment=ZYPP_LOCK_TIMEOUT=-1
> sudo systemctl enable --now purge-kernels
Actions #13

Updated by okurz almost 3 years ago

  • Due date set to 2021-06-01
  • Target version changed from future to Ready

nice :+1:

As the problem did not happen that often in before I am not sure if we will see it again soon by just waiting. I suggest you do the following: In one shell start an interactive zypper process, e.g. zypper dup which by default will wait for your confirmation, keep it running. In a second shell start the service purge-kernels.service and see if it waits for zypper, then close the manual zypper process after about 30s and observe if the purge-kernels.service continues fine.

Actions #14

Updated by livdywan almost 3 years ago

So it seems that if zypper is locked I actually get an error:

sudo zypper dup &
sudo env ZYPP_LOCK_TIMEOUT=-1 zypper -n purge-kernels
System management is locked by the application with pid 2491 (zypper).
Close this application before trying again.

Maybe there's still a piece missing in the deployed libzypp? Even though the systemd unit has the variable specified. Will need to confirm that.

Actions #15

Updated by livdywan almost 3 years ago

cdywan wrote:

Maybe there's still a piece missing in the deployed libzypp? Even though the systemd unit has the variable specified. Will need to confirm that.

So we have libzypp-17.25.10-lp152.2.28.1.src but this will only be supported by the so far unreleased 17.25.11 which purge-kernels doesn't depend on.

Actions #16

Updated by livdywan almost 3 years ago

  • Status changed from Feedback to Blocked
Actions #17

Updated by nicksinger almost 3 years ago

recent update on the upstream issue:

Unfortunately libzypp-17.25.11 delayed, but I guess we're able to release it this week.
Actions #18

Updated by livdywan almost 3 years ago

  • Due date changed from 2021-06-01 to 2021-06-04

Bumpoing the due date as we're waiting for the new libzypp release

Actions #19

Updated by livdywan almost 3 years ago

  • Due date changed from 2021-06-04 to 2021-06-15

cdywan wrote:

Bumpoing the due date as we're waiting for the new libzypp release

Still waiting on the release. I can't do much at this point so let's give it another week.

Actions #20

Updated by okurz almost 3 years ago

https://bugzilla.opensuse.org/show_bug.cgi?id=1184399#c22 now states "An update that has four recommended fixes can now be installed." which is not directly referencing Leap. As Leap 15.2 uses independant update channels I suggest you wait until the due date has passed here and then you can check if you have another message in the bug report or if the update reached Leap and if not check back if maybe someone needs to explicitly submit to openSUSE Leap 15.2 as well.

Actions #21

Updated by livdywan almost 3 years ago

  • Due date changed from 2021-06-15 to 2021-06-18

So we have libzypp-17.25.10-lp152.2.28.1.src but this will only be supported by the so far unreleased 17.25.11 which purge-kernels doesn't depend on.

We seem to have gotten an update:

> zypper if libzypp | grep Version
Version        : 17.25.10-lp152.2.31.1

But this is still not the new version we need.

Actions #22

Updated by livdywan almost 3 years ago

> zypper if libzypp | grep Version
Version        : 17.26.0-lp152.2.34.1

Still doesn't seem to unblock when zypper's done...

Actions #23

Updated by livdywan almost 3 years ago

  • Due date changed from 2021-06-18 to 2021-07-02

See above comment

Actions #24

Updated by livdywan almost 3 years ago

  • Due date changed from 2021-07-02 to 2021-07-09

The last comment on the bug was and still is:
openSUSE Leap 15.2 (src): libsolv-0.7.19-lp152.2.25.1, libzypp-17.26.0-lp152.2.34.1, zypper-1.14.45-lp152.2.24.1

So not much we can do here I guess besides waiting?

Actions #25

Updated by okurz almost 3 years ago

  • Due date changed from 2021-07-09 to 2021-07-06
  • Status changed from Blocked to Feedback

cdywan wrote:

The last comment on the bug was and still is:
openSUSE Leap 15.2 (src): libsolv-0.7.19-lp152.2.25.1, libzypp-17.26.0-lp152.2.34.1, zypper-1.14.45-lp152.2.24.1

So not much we can do here I guess besides waiting?

I am not sure what you expect. Previously https://bugzilla.opensuse.org/show_bug.cgi?id=1184399#c15 mentioned only SLE submissions. But then https://bugzilla.opensuse.org/show_bug.cgi?id=1184399#c24 already on 2021-06-16 stated "An update […] can now be installed." and you can take that literal as the point in time when updates are available to the end users. And the versions it mentions – same as you referenced above – are already bigger than the ones that you previously were waiting for. I tested myself within podman run --pull=always --rm -it registry.opensuse.org/opensuse/leap:15.2 by installing the according packages and running zypper in htop in one shell, waiting at the "y/n/…" prompt and I did not confirm that selection. Then in another shell I ran time env ZYPP_LOCK_TIMEOUT=-1 /usr/bin/zypper -n purge-kernels, waited two minutes (way past the original 30s timeout), then confirmed the prompt in the first shell, ending the interactive zypper process. In the second shell the "purge-kernels" call quickly continued and succeeded, real time overall 2m1s. So I consider this bug verification successful. I provided this text as bug verification in https://bugzilla.opensuse.org/show_bug.cgi?id=1184399#c25 as well.

And ssh openqaworker3 -- sudo rpm -q libzypp confirms that the fixed version is already installed: libzypp-17.26.0-lp152.2.34.1.x86_64

What did you do differently?

Actions #26

Updated by livdywan almost 3 years ago

  • Status changed from Feedback to Resolved

okurz wrote:

cdywan wrote:

The last comment on the bug was and still is:
openSUSE Leap 15.2 (src): libsolv-0.7.19-lp152.2.25.1, libzypp-17.26.0-lp152.2.34.1, zypper-1.14.45-lp152.2.24.1

So not much we can do here I guess besides waiting?

I am not sure what you expect.

I expect that either I made a mistake in my manual test or there's something I missed when reading the bug. It seems like the former is the case, as you confirmed that it works now.

Actions #27

Updated by okurz over 1 year ago

  • Tags changed from alert, systemd, service, zypper, infrastructure to alert, systemd, service, zypper, infra
Actions

Also available in: Atom PDF