action #117262
closed[alert] failed systemd service: ca-certificates on openqa.suse.de, "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17" size:M
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services failed today . ca-certificates on osd shows:
Sep 27 07:18:52 openqa systemd[1]: Starting Update system wide CA certificates...
Sep 27 07:18:53 openqa update-ca-certificates[7397]: p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17
Sep 27 07:18:53 openqa systemd[1]: ca-certificates.service: Main process exited, code=exited, status=1/FAILURE
Sep 27 07:18:53 openqa systemd[1]: ca-certificates.service: Failed with result 'exit-code'.
Sep 27 07:18:53 openqa systemd[1]: Failed to start Update system wide CA certificates.
A simple restart fixed that
Suggestions¶
- DONE
Research if we can find something about this error-> observed it again in https://progress.opensuse.org/issues/131096 - DONE
Look into the system log around this time if there was any other related error-> no other error - Might be an upstream bug. If not exists report it
- Take a look into /usr/lib/ca-certificates/update.d/99certbundle.run , like https://github.com/openSUSE/ca-certificates/blob/master/certbundle.run#L37, might be easy to provide upstream contribution pull request to fix in https://github.com/openSUSE/ca-certificates , package is https://build.opensuse.org/package/show/openSUSE:Factory/ca-certificates
Updated by okurz about 2 years ago
- Related to action #104172: osd service ca-certificates failed with "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: File exists" added
Updated by okurz about 2 years ago
- Status changed from In Progress to Resolved
During a web research I haven't found anything useful other than the our own older ticket #104172 which I linked. And nothing useful in the system log
Updated by nicksinger about 2 years ago
The only stuff I could find are two occurrences in the p11-kit source code:
https://github.com/p11-glue/p11-kit/blob/7b1ef9e559e7f7bb2c743abed7688b621cda9f88/trust/save.c#L206-L211
and
https://github.com/p11-glue/p11-kit/blob/7b1ef9e559e7f7bb2c743abed7688b621cda9f88/trust/save.c#L224-L229
As the second one does not pass an errno I'd suspect the first one one fail here. Interestingly enough the code for printing the code should print the errno resolved to a human readable name:
https://github.com/p11-glue/p11-kit/blob/34b568727ff98ebb36f45a3d63c07f165c58219b/common/message.c#L124 (do we miss a proper locale on OSD? Or is it just broken in p11-kit's environment?)
Anyhow, 17 belongs to "EEXIST" (errno -l
on OSD maps the codes to their names) which could point to left-overs after the recent crashes we suffered.
Updated by nicksinger about 2 years ago
I found https://bugzilla.suse.com/show_bug.cgi?id=1100241 which mentioned that the ca-certificates.service should be disabled on "normal" installations (which is indeed the case on OSD) and found that there is ca-certificates.path triggering the service.
This .path unit monitors several places where "manual" certificates can be deployed and takes care of automatically calling update-ca-certificates
if done so. All other certificates which are shipped by packages should call update-ca-certificates
in their %post hook. I followed this clue and found two certificates which are monitored by this path-unit on OSD:
/usr/share/pki/trust/ca-certificates-mozila.trust.p11-kit
/usr/share/pki/trust/anchors/SUSE_Trust_Root.crt.pem
belonging to the packages ca-certificates-mozilla and ca-certificates-suse. mozilla
coming from SLE15 update repo and suse
from the SUSE_CA repo. So one hypothesis is a race-condition between the path-service vs. %post-hook of one of the two packages.
Looking at the journal of ca-certificates.path shows that previously something stopped this watch:
-- Boot 58ce37dfcd7b43578ebac8c0ca8ee2a3 --
Sep 21 17:41:13 openqa systemd[1]: Started Watch for changes in CA certificates.
Sep 25 03:30:15 openqa systemd[1]: ca-certificates.path: Deactivated successfully.
Sep 25 03:30:16 openqa systemd[1]: Stopped Watch for changes in CA certificates.
-- Boot 3a007cbe2d914beeaa138da98e3606c5 --
Sep 25 03:30:56 openqa systemd[1]: Started Watch for changes in CA certificates.
-- Boot 37b8d07bd19743f5b73de54f2d8baa4f --
Sep 26 16:27:48 openqa systemd[1]: Started Watch for changes in CA certificates.
Sep 26 16:51:31 openqa systemd[1]: ca-certificates.path: Deactivated successfully.
Sep 26 16:51:31 openqa systemd[1]: Stopped Watch for changes in CA certificates.
-- Boot 0e3e2adc06df4ad98653780f2955335e --
Sep 26 16:52:16 openqa systemd[1]: Started Watch for changes in CA certificates.
but not since the last boot. I think this is why we see this sporadically.
Possible workarounds/solutions:
- make sure
ca-certificates.path
is disabled- downside: certificates in the mentioned paths need to be manually added by calling
update-ca-certificates
- IIUC ca-certificates-suse already behaves properly and updates in %post: https://build.suse.de/package/view_file/SUSE:CA/ca-certificates-suse/ca-certificates-suse.spec
- I didn't check the spec of ca-certificates-mozilla
- downside: certificates in the mentioned paths need to be manually added by calling
- figure out while the two mentioned packages write into that location and not like other certificates (which one, actually?) into the "proper" location
Updated by okurz about 2 years ago
- Status changed from Resolved to New
- Assignee deleted (
okurz)
with the additional information we can work on the mentioned suggestions to improve and prevent further problems.
Updated by okurz over 1 year ago
- Related to action #131096: [alert] Service `ca-certificates` can fail size:M added
Updated by okurz over 1 year ago
- Subject changed from [alert] failed systemd service: ca-certificates on openqa.suse.de, "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17" to [alert] failed systemd service: ca-certificates on openqa.suse.de, "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-06
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Due date deleted (
2023-07-06) - Status changed from In Progress to Workable
- Assignee deleted (
nicksinger)
As discussed in the weekly unassigning and leaving for others to do. The next suggestion still holds: Just report an upstream issue and see if anybody has an idea.
Updated by mkittler over 1 year ago
I'm not 100 % sure whether #131096 is a duplicate (as the ticket description suggests). This ticket is about:
Sep 27 07:18:53 openqa update-ca-certificates[7397]: p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17
and the other ticket about:
Jun 18 03:01:49 schort-server update-ca-certificates[29527]: mv: cannot stat '/var/lib/ca-certificates/ca-bundle.pem.new': No such file or directory
Updated by mkittler over 1 year ago
- Assignee set to mkittler
There are no problematic scripts in /etc/ca-certificates/update.d
on those hosts.
The failing mv
might be in /usr/lib/ca-certificates/update.d/99certbundle.run
which I had a look at on schort-server:
set -e
cafile="/var/lib/ca-certificates/ca-bundle.pem"
cadir="/var/lib/ca-certificates/pem"
…
trust extract --format=pem-bundle --purpose=server-auth --filter=ca-anchors $cafile.tmp
cat $cafile.tmp >> $cafile.new
rm -f $cafile.tmp
mv "$cafile.new" "$cafile"
The other scripts in that directory don't have a mv
command that would produce this error message. However, I'm also still wondering how it can happen in 99certbundle.run
because it looks like the script will either fail earlier or succeed. Maybe something else did something to the file in the background (like a 2nd instance of that script running in parallel)? That would be in-line with the hypothesis @nsinger stated in #117262#note-4.
About the p11-kit
issue: I'm not even sure when that would happen. I didn't find an invocation in any of the scripts and also nothing in the ca-certificates Git repo except comments/documentation.
Updated by mkittler over 1 year ago
- Status changed from Workable to Feedback
I haven't found an upstream bug so I've just created a new one: https://github.com/openSUSE/ca-certificates/issues/20
Not sure whether it makes sense to try to create this on our own. At least the p11-kit part is a bit strange to me and maybe upstream has a better idea how to fix this.
Updated by okurz over 1 year ago
As discussed in weekly: We have the upstream report. Now we should implement a workaround. Just in the systemd service implement a restart. Likely something like
systemctl edit ca-certificates
and then add a ticket reference and a
Restart=on-failure
or something along the lines. Should be easy to do in salt based on "override.conf" examples we already have.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
Updated by okurz over 1 year ago
- Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/927 merged. Given that we have not received any feedback in the upstream issue and the issue wouldn't likely impact us further soon I will now resolve this