action #90170
closedService for purging old kernels might run while system management is locked and fail
0%
Description
Observation¶
Today observed on openqaworker3
when checking for the reason of the failed systemd services alert:
martchus@openqaworker3:/srv/salt> sudo systemctl status purge-kernels.service
purge-kernels.service - Purge old kernels
Loaded: loaded (/usr/lib/systemd/system/purge-kernels.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2021-03-15 06:51:54 CET; 1 day 4h ago
Main PID: 1166 (code=exited, status=7)
Mar 15 06:51:36 openqaworker3 systemd[1]: Starting Purge old kernels...
Mar 15 06:51:54 openqaworker3 zypper[1166]: System management is locked by the application with pid 1286 (zypper).
Mar 15 06:51:54 openqaworker3 zypper[1166]: Close this application before trying again.
Mar 15 06:51:54 openqaworker3 systemd[1]: purge-kernels.service: Main process exited, code=exited, status=7/NOTRUNNING
Mar 15 06:51:54 openqaworker3 systemd[1]: Failed to start Purge old kernels.
Mar 15 06:51:54 openqaworker3 systemd[1]: purge-kernels.service: Unit entered failed state.
Mar 15 06:51:54 openqaworker3 systemd[1]: purge-kernels.service: Failed with result 'exit-code'.
Of course it helps to simply restart the service. Not sure how we could further improve this. This is likely just a caveat of openSUSE's purge-kernels-service
package which provides that service (which likely comes from https://github.com/openSUSE/mkinitrd).
Acceptance criteria¶
- AC1: The systemd service purge-kernels.service does not fail if zypper is running for a short time
Suggestion¶
Report this issue upstream as bug and in the meantime apply a workaround for us, e.g. systemd service override with retry.
Updated by okurz almost 4 years ago
- Tags set to alert, systemd, service, zypper, infrastructure
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Low to Normal
- Target version set to Ready
Updated by mkittler over 3 years ago
I've created an upstream issue: https://github.com/openSUSE/mkinitrd/issues/43
Likely the best one can do is adding a retry. I suppose that would mean that the service would still be in the state "failed" between the retries so our monitoring would still trigger an alert.
Updated by okurz over 3 years ago
mkittler wrote:
I've created an upstream issue: https://github.com/openSUSE/mkinitrd/issues/43
https://github.com/openSUSE/mkinitrd/issues/43 was closed referring to bugzilla. So please report there.
Likely the best one can do is adding a retry. I suppose that would mean that the service would still be in the state "failed" between the retries so our monitoring would still trigger an alert.
Not if the alerting grace time is longer than the retry interval.
Updated by mkittler over 3 years ago
BugZilla ticket: https://bugzilla.opensuse.org/show_bug.cgi?id=1184399
Updated by okurz over 3 years ago
Thank you for the bug report. Now we can implement a workaround for ourselves.
Updated by livdywan over 3 years ago
We could check that zypper isn't in locked state:
7 - ZYPPER_EXIT_ZYPP_LOCKED
The ZYPP library is locked, e.g. packagekit is running.
Might be nicer than just Retry?
Updated by livdywan over 3 years ago
So apparently ZYPP_LOCK_TIMEOUT=-1
is now a supported option (see https://github.com/openSUSE/libzypp/pull/314) and boo#1184399 is fixed.
I don't know if there's a fix or bug for the purge kernel services yet 🤔️
Updated by livdywan over 3 years ago
cdywan wrote:
I don't know if there's a fix or bug for the purge kernel services yet 🤔️
Updated by livdywan over 3 years ago
- Status changed from Workable to Feedback
- Assignee set to livdywan
Since the package got updated in the meanwhile I went ahead and re-enabled it:
> zypper if purge-kernels-service
Repository 'Main Update Repository' is out-of-date. You can run 'zypper refresh' as root to update it.
Loading repository data...
Reading installed packages...
Information for package purge-kernels-service:
----------------------------------------------
Repository : Main Update Repository
Name : purge-kernels-service
Version : 0-lp152.5.3.1
Arch : noarch
Vendor : openSUSE
Installed Size : 346 B
Installed : Yes (automatically)
Status : up-to-date
Source package : purge-kernels-service-0-lp152.5.3.1.src
Summary : The service for removing old kernels when multiversion is enabled
Description :
This service runs zypper purge-kernels on boot after a kernel package was
installed.
> grep ZYPP_LOCK_TIMEOUT /usr/lib/systemd/system/purge-kernels.service
Environment=ZYPP_LOCK_TIMEOUT=-1
> sudo systemctl enable --now purge-kernels
Updated by okurz over 3 years ago
- Due date set to 2021-06-01
- Target version changed from future to Ready
nice :+1:
As the problem did not happen that often in before I am not sure if we will see it again soon by just waiting. I suggest you do the following: In one shell start an interactive zypper process, e.g. zypper dup
which by default will wait for your confirmation, keep it running. In a second shell start the service purge-kernels.service
and see if it waits for zypper, then close the manual zypper process after about 30s and observe if the purge-kernels.service continues fine.
Updated by livdywan over 3 years ago
So it seems that if zypper is locked I actually get an error:
sudo zypper dup &
sudo env ZYPP_LOCK_TIMEOUT=-1 zypper -n purge-kernels
System management is locked by the application with pid 2491 (zypper).
Close this application before trying again.
Maybe there's still a piece missing in the deployed libzypp? Even though the systemd unit has the variable specified. Will need to confirm that.
Updated by livdywan over 3 years ago
cdywan wrote:
Maybe there's still a piece missing in the deployed libzypp? Even though the systemd unit has the variable specified. Will need to confirm that.
So we have libzypp-17.25.10-lp152.2.28.1.src but this will only be supported by the so far unreleased 17.25.11 which purge-kernels doesn't depend on.
Updated by nicksinger over 3 years ago
recent update on the upstream issue:
Unfortunately libzypp-17.25.11 delayed, but I guess we're able to release it this week.
Updated by livdywan over 3 years ago
- Due date changed from 2021-06-01 to 2021-06-04
Bumpoing the due date as we're waiting for the new libzypp release
Updated by livdywan over 3 years ago
- Due date changed from 2021-06-04 to 2021-06-15
cdywan wrote:
Bumpoing the due date as we're waiting for the new libzypp release
Still waiting on the release. I can't do much at this point so let's give it another week.
Updated by okurz over 3 years ago
https://bugzilla.opensuse.org/show_bug.cgi?id=1184399#c22 now states "An update that has four recommended fixes can now be installed." which is not directly referencing Leap. As Leap 15.2 uses independant update channels I suggest you wait until the due date has passed here and then you can check if you have another message in the bug report or if the update reached Leap and if not check back if maybe someone needs to explicitly submit to openSUSE Leap 15.2 as well.
Updated by livdywan over 3 years ago
- Due date changed from 2021-06-15 to 2021-06-18
So we have libzypp-17.25.10-lp152.2.28.1.src but this will only be supported by the so far unreleased 17.25.11 which purge-kernels doesn't depend on.
We seem to have gotten an update:
> zypper if libzypp | grep Version
Version : 17.25.10-lp152.2.31.1
But this is still not the new version we need.
Updated by livdywan over 3 years ago
> zypper if libzypp | grep Version
Version : 17.26.0-lp152.2.34.1
Still doesn't seem to unblock when zypper's done...
Updated by livdywan over 3 years ago
- Due date changed from 2021-06-18 to 2021-07-02
See above comment
Updated by livdywan over 3 years ago
- Due date changed from 2021-07-02 to 2021-07-09
The last comment on the bug was and still is:
openSUSE Leap 15.2 (src): libsolv-0.7.19-lp152.2.25.1, libzypp-17.26.0-lp152.2.34.1, zypper-1.14.45-lp152.2.24.1
So not much we can do here I guess besides waiting?
Updated by okurz over 3 years ago
- Due date changed from 2021-07-09 to 2021-07-06
- Status changed from Blocked to Feedback
cdywan wrote:
The last comment on the bug was and still is:
openSUSE Leap 15.2 (src): libsolv-0.7.19-lp152.2.25.1, libzypp-17.26.0-lp152.2.34.1, zypper-1.14.45-lp152.2.24.1So not much we can do here I guess besides waiting?
I am not sure what you expect. Previously https://bugzilla.opensuse.org/show_bug.cgi?id=1184399#c15 mentioned only SLE submissions. But then https://bugzilla.opensuse.org/show_bug.cgi?id=1184399#c24 already on 2021-06-16 stated "An update […] can now be installed." and you can take that literal as the point in time when updates are available to the end users. And the versions it mentions – same as you referenced above – are already bigger than the ones that you previously were waiting for. I tested myself within podman run --pull=always --rm -it registry.opensuse.org/opensuse/leap:15.2
by installing the according packages and running zypper in htop
in one shell, waiting at the "y/n/…" prompt and I did not confirm that selection. Then in another shell I ran time env ZYPP_LOCK_TIMEOUT=-1 /usr/bin/zypper -n purge-kernels
, waited two minutes (way past the original 30s timeout), then confirmed the prompt in the first shell, ending the interactive zypper process. In the second shell the "purge-kernels" call quickly continued and succeeded, real time overall 2m1s. So I consider this bug verification successful. I provided this text as bug verification in https://bugzilla.opensuse.org/show_bug.cgi?id=1184399#c25 as well.
And ssh openqaworker3 -- sudo rpm -q libzypp
confirms that the fixed version is already installed: libzypp-17.26.0-lp152.2.34.1.x86_64
What did you do differently?
Updated by livdywan over 3 years ago
- Status changed from Feedback to Resolved
okurz wrote:
cdywan wrote:
The last comment on the bug was and still is:
openSUSE Leap 15.2 (src): libsolv-0.7.19-lp152.2.25.1, libzypp-17.26.0-lp152.2.34.1, zypper-1.14.45-lp152.2.24.1So not much we can do here I guess besides waiting?
I am not sure what you expect.
I expect that either I made a mistake in my manual test or there's something I missed when reading the bug. It seems like the former is the case, as you confirmed that it works now.
Updated by okurz about 2 years ago
- Tags changed from alert, systemd, service, zypper, infrastructure to alert, systemd, service, zypper, infra