Project

General

Profile

Actions

action #117262

closed

[alert] failed systemd service: ca-certificates on openqa.suse.de, "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17" size:M

Added by okurz about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2022-09-27
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services failed today . ca-certificates on osd shows:

Sep 27 07:18:52 openqa systemd[1]: Starting Update system wide CA certificates...
Sep 27 07:18:53 openqa update-ca-certificates[7397]: p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17
Sep 27 07:18:53 openqa systemd[1]: ca-certificates.service: Main process exited, code=exited, status=1/FAILURE
Sep 27 07:18:53 openqa systemd[1]: ca-certificates.service: Failed with result 'exit-code'.
Sep 27 07:18:53 openqa systemd[1]: Failed to start Update system wide CA certificates.

A simple restart fixed that

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #104172: osd service ca-certificates failed with "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: File exists"Resolvedokurz2021-12-20

Actions
Related to openQA Infrastructure (public) - action #131096: [alert] Service `ca-certificates` can fail size:MRejectedokurz2023-06-19

Actions
Actions #1

Updated by okurz about 2 years ago

  • Related to action #104172: osd service ca-certificates failed with "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: File exists" added
Actions #2

Updated by okurz about 2 years ago

  • Status changed from In Progress to Resolved

During a web research I haven't found anything useful other than the our own older ticket #104172 which I linked. And nothing useful in the system log

Actions #3

Updated by nicksinger about 2 years ago

The only stuff I could find are two occurrences in the p11-kit source code:
https://github.com/p11-glue/p11-kit/blob/7b1ef9e559e7f7bb2c743abed7688b621cda9f88/trust/save.c#L206-L211
and
https://github.com/p11-glue/p11-kit/blob/7b1ef9e559e7f7bb2c743abed7688b621cda9f88/trust/save.c#L224-L229

As the second one does not pass an errno I'd suspect the first one one fail here. Interestingly enough the code for printing the code should print the errno resolved to a human readable name:
https://github.com/p11-glue/p11-kit/blob/34b568727ff98ebb36f45a3d63c07f165c58219b/common/message.c#L124 (do we miss a proper locale on OSD? Or is it just broken in p11-kit's environment?)

Anyhow, 17 belongs to "EEXIST" (errno -l on OSD maps the codes to their names) which could point to left-overs after the recent crashes we suffered.

Actions #4

Updated by nicksinger about 2 years ago

I found https://bugzilla.suse.com/show_bug.cgi?id=1100241 which mentioned that the ca-certificates.service should be disabled on "normal" installations (which is indeed the case on OSD) and found that there is ca-certificates.path triggering the service.
This .path unit monitors several places where "manual" certificates can be deployed and takes care of automatically calling update-ca-certificates if done so. All other certificates which are shipped by packages should call update-ca-certificates in their %post hook. I followed this clue and found two certificates which are monitored by this path-unit on OSD:

/usr/share/pki/trust/ca-certificates-mozila.trust.p11-kit
/usr/share/pki/trust/anchors/SUSE_Trust_Root.crt.pem

belonging to the packages ca-certificates-mozilla and ca-certificates-suse. mozilla coming from SLE15 update repo and suse from the SUSE_CA repo. So one hypothesis is a race-condition between the path-service vs. %post-hook of one of the two packages.
Looking at the journal of ca-certificates.path shows that previously something stopped this watch:

-- Boot 58ce37dfcd7b43578ebac8c0ca8ee2a3 --
Sep 21 17:41:13 openqa systemd[1]: Started Watch for changes in CA certificates.
Sep 25 03:30:15 openqa systemd[1]: ca-certificates.path: Deactivated successfully.
Sep 25 03:30:16 openqa systemd[1]: Stopped Watch for changes in CA certificates.
-- Boot 3a007cbe2d914beeaa138da98e3606c5 --
Sep 25 03:30:56 openqa systemd[1]: Started Watch for changes in CA certificates.
-- Boot 37b8d07bd19743f5b73de54f2d8baa4f --
Sep 26 16:27:48 openqa systemd[1]: Started Watch for changes in CA certificates.
Sep 26 16:51:31 openqa systemd[1]: ca-certificates.path: Deactivated successfully.
Sep 26 16:51:31 openqa systemd[1]: Stopped Watch for changes in CA certificates.
-- Boot 0e3e2adc06df4ad98653780f2955335e --
Sep 26 16:52:16 openqa systemd[1]: Started Watch for changes in CA certificates.

but not since the last boot. I think this is why we see this sporadically.

Possible workarounds/solutions:

  1. make sure ca-certificates.path is disabled
  2. figure out while the two mentioned packages write into that location and not like other certificates (which one, actually?) into the "proper" location
Actions #5

Updated by okurz about 2 years ago

  • Status changed from Resolved to New
  • Assignee deleted (okurz)

with the additional information we can work on the mentioned suggestions to improve and prevent further problems.

Actions #6

Updated by okurz about 2 years ago

  • Target version changed from Ready to future
Actions #7

Updated by okurz almost 2 years ago

  • Tags set to infra
Actions #8

Updated by okurz over 1 year ago

  • Related to action #131096: [alert] Service `ca-certificates` can fail size:M added
Actions #9

Updated by okurz over 1 year ago

  • Target version changed from future to Ready
Actions #10

Updated by okurz over 1 year ago

  • Subject changed from [alert] failed systemd service: ca-certificates on openqa.suse.de, "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17" to [alert] failed systemd service: ca-certificates on openqa.suse.de, "p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by nicksinger over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #12

Updated by openqa_review over 1 year ago

  • Due date set to 2023-07-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by okurz over 1 year ago

  • Due date deleted (2023-07-06)
  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)

As discussed in the weekly unassigning and leaving for others to do. The next suggestion still holds: Just report an upstream issue and see if anybody has an idea.

Actions #14

Updated by mkittler over 1 year ago

I'm not 100 % sure whether #131096 is a duplicate (as the ticket description suggests). This ticket is about:

Sep 27 07:18:53 openqa update-ca-certificates[7397]: p11-kit: couldn't complete writing of file: /var/lib/ca-certificates/ca-bundle.pem.tmp: Unknown error 17

and the other ticket about:

Jun 18 03:01:49 schort-server update-ca-certificates[29527]: mv: cannot stat '/var/lib/ca-certificates/ca-bundle.pem.new': No such file or directory
Actions #15

Updated by mkittler over 1 year ago

  • Assignee set to mkittler

There are no problematic scripts in /etc/ca-certificates/update.d on those hosts.

The failing mv might be in /usr/lib/ca-certificates/update.d/99certbundle.run which I had a look at on schort-server:

set -e

cafile="/var/lib/ca-certificates/ca-bundle.pem"
cadir="/var/lib/ca-certificates/pem"
…
trust extract --format=pem-bundle --purpose=server-auth --filter=ca-anchors $cafile.tmp
cat $cafile.tmp >> $cafile.new
rm -f $cafile.tmp
mv "$cafile.new" "$cafile"

The other scripts in that directory don't have a mv command that would produce this error message. However, I'm also still wondering how it can happen in 99certbundle.run because it looks like the script will either fail earlier or succeed. Maybe something else did something to the file in the background (like a 2nd instance of that script running in parallel)? That would be in-line with the hypothesis @nsinger stated in #117262#note-4.

About the p11-kit issue: I'm not even sure when that would happen. I didn't find an invocation in any of the scripts and also nothing in the ca-certificates Git repo except comments/documentation.

Actions #16

Updated by mkittler over 1 year ago

  • Status changed from Workable to Feedback

I haven't found an upstream bug so I've just created a new one: https://github.com/openSUSE/ca-certificates/issues/20

Not sure whether it makes sense to try to create this on our own. At least the p11-kit part is a bit strange to me and maybe upstream has a better idea how to fix this.

Actions #17

Updated by okurz over 1 year ago

As discussed in weekly: We have the upstream report. Now we should implement a workaround. Just in the systemd service implement a restart. Likely something like

systemctl edit ca-certificates

and then add a ticket reference and a

Restart=on-failure

or something along the lines. Should be easy to do in salt based on "override.conf" examples we already have.

Actions #18

Updated by okurz over 1 year ago

  • Status changed from Feedback to In Progress
Actions #19

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback
Actions #20

Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/927 merged. Given that we have not received any feedback in the upstream issue and the issue wouldn't likely impact us further soon I will now resolve this

Actions

Also available in: Atom PDF