Project

General

Profile

Actions

action #105960

closed

Dehydrated fails on OSD size:M

Added by mkittler almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2022-02-04
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Feb 04 10:55:19 openqa systemd[1]: Starting Certificate Update Runner for Dehydrated...
Feb 04 10:55:19 openqa dehydrated[6052]: # INFO: Using main config file /etc/dehydrated/config
Feb 04 10:55:19 openqa dehydrated[6052]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Feb 04 10:55:19 openqa dehydrated[6052]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Feb 04 10:55:19 openqa sudo[6052]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Feb 04 10:55:20 openqa dehydrated[6427]: {}
Feb 04 10:55:20 openqa systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 10:55:20 openqa systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Feb 04 10:55:20 openqa systemd[1]: Failed to start Certificate Update Runner for Dehydrated.

Acceptance criteria

  • AC1: OSD is automatically dehydrated on a regular schedule

Suggestions

Out of scope

  • qem dashboard (different ticket)

Related issues 1 (0 open1 closed)

Has duplicate openQA Infrastructure (public) - action #106035: [qe-tools] dehydrated service fails on osdRejectedokurz2022-02-07

Actions
Actions #1

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler

This is the full error message:

martchus@openqa:~> sudo -u dehydrated dehydrated --cron
# INFO: Using main config file /etc/dehydrated/config
# INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Processing openqa.suse.de with alternative names: openqa.nue.suse.com
 + Checking domain name(s) of existing cert... unchanged.
 + Checking expire date of existing cert...
 + Valid till Feb 26 23:09:44 2022 GMT (Less than 23 days). Renewing!
 + Signing domains...
 + Generating signing request...
 + Requesting new certificate order from CA...
  + ERROR: An error occurred while sending post-request to https://ca-internal.suse.de/acme/acme/new-order (Status 500)

Details:
HTTP/2 500 
cache-control: no-store
content-type: application/json
link: <https://ca-internal.suse.de/acme/acme/directory>;rel="index"
replay-nonce: ZWRLTmM1VTBvUEpBZnM4QkN5UVNyUEZYYk5MOWc4OUc
content-length: 3
date: Fri, 04 Feb 2022 15:48:23 GMT

{}


/usr/bin/dehydrated: Zeile 737: 1 ist nicht gesetzt.
Actions #2

Updated by mkittler almost 3 years ago

  • Assignee deleted (mkittler)

Let's talk about / estimate this ticket together next week.

Actions #3

Updated by okurz almost 3 years ago

I guess a retry in a systemd service override would suffice

Actions #4

Updated by jbaier_cz almost 3 years ago

I see the same error on qam2, so it seems something is broken on the CA side:

Feb 07 01:52:32 qam2 systemd[1]: Starting Certificate Update Runner for Dehydrated...
Feb 07 01:52:32 qam2 dehydrated[25822]: # INFO: Using main config file /etc/dehydrated/config
Feb 07 01:52:32 qam2 dehydrated[25822]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Feb 07 01:52:32 qam2 sudo[25822]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Feb 07 01:52:35 qam2 dehydrated[26142]: HTTP/2 500
Feb 07 01:52:35 qam2 dehydrated[26142]: cache-control: no-store
Feb 07 01:52:35 qam2 dehydrated[26142]: content-type: application/json
Feb 07 01:52:35 qam2 dehydrated[26142]: link: <https://ca-internal.suse.de/acme/acme/directory>;rel="index"
Feb 07 01:52:35 qam2 dehydrated[26142]: replay-nonce: RDc3am9aSmZoMDBpaHdPamo3WVp4U0RweXpnU3F3UU4
Feb 07 01:52:35 qam2 dehydrated[26142]: content-length: 3
Feb 07 01:52:35 qam2 dehydrated[26142]: date: Mon, 07 Feb 2022 00:52:35 GMT
Feb 07 01:52:35 qam2 dehydrated[26142]: 
Feb 07 01:52:35 qam2 dehydrated[26143]: {}
Feb 07 01:52:35 qam2 systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Feb 07 01:52:35 qam2 systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Feb 07 01:52:35 qam2 systemd[1]: Failed to start Certificate Update Runner for Dehydrated.

The service itself is started from a timer (at least on qam2) with OnCalendar=daily, so there is no need to add retry to the service (which will fail again).

Actions #5

Updated by okurz almost 3 years ago

  • Has duplicate action #106035: [qe-tools] dehydrated service fails on osd added
Actions #7

Updated by jbaier_cz almost 3 years ago

  • Status changed from New to Feedback
  • Assignee set to jbaier_cz

Reported as SD-76058

Actions #8

Updated by jbaier_cz almost 3 years ago

Based on infra investigation and my assumptions, a package upgrade on the STEP-CA side did something to its internal account database, which can be repaired by "recreating the account", I found out it is sufficient to invoke dehydrated --account. After that the renewal works. Apart from that, they have no idea what went wrong (the server side logs are apparently limited).

On the qam2, I did

dehydrated --account
systemctl start dehydrated.service

and it solved the problem.

Actions #10

Updated by mkittler almost 3 years ago

I moved the old account under /etc/dehydrated/accounts to a backup directory and created a new one via martchus@openqa:~> sudo -u dehydrated dehydrated --register --accept-terms. It now works again and the certificate has been renewed, indeed. (Without "disabling" the old account dehydrated would not create a new one. The new account apparently also has the same hash/checksum/id as the old account.)

Actions #11

Updated by okurz almost 3 years ago

so do you have some idea how to recreate the account automatically in case something similar happens again? Just because we don't have dehydrated for that long and this problem already showed up I suggest to look into this.

Actions #12

Updated by mkittler almost 3 years ago

The alert is off again so. We could try to automate the steps I did in case we get a 500 error but I'm not sure whether it makes generally sense and how to implement it. We'd needed to read the HTTP status code in some "post fail" script.

Actions #13

Updated by okurz almost 3 years ago

do we need a wrapper shell script for dehydrated with something like

dehydrated ...
for i in {1..7}; do
    curl -sS https://openqa.suse.de && exit 0
    sudo rm -rf /etc/dehydrated/accounts
    sudo -u dehydrated dehydrated --register --accept-terms
    sudo systemctl restart dehydrated
done
Actions #14

Updated by mkittler almost 3 years ago

Not sure whether we should forcefully delete all accounts every time we call dehydrated. And restarting the service from itself also doesn't seem to be the best idea. We could add a Restart=… on systemd-level and delete all accounts via ExecStopPost=… if $SERVICE_RESULT is failure.

Actions #15

Updated by okurz almost 3 years ago

mkittler wrote:

Not sure whether we should forcefully delete all accounts every time we call dehydrated.

I did not suggest that. I would do that only after the certificate renew fails.

And restarting the service from itself also doesn't seem to be the best idea. We could add a Restart=… on systemd-level and delete all accounts via ExecStopPost=… if $SERVICE_RESULT is failure.

Well, that could an option but we should favor relying less on systemd and instead have something that we could also call, you know, from a container or cloud instance, bootstrap script, you know.

Actions #16

Updated by mkittler almost 3 years ago

ok

I've also just did the same as on OSD on the monitoring host which had the same problem.

Actions #17

Updated by jbaier_cz almost 3 years ago

If I understood correctly, our certificates are renewed right now. So I propose to wait for the next renewal round in two weeks and see.

I am not a salt expert so I might ask a wrong question: is it possible to have a different salt task (probably a "state" in salt terminology) which will delete the dehydrated data and then the standard default state which will again install it and obtain the certificates?

Actions #18

Updated by livdywan almost 3 years ago

  • Subject changed from Dehydrated fails on OSD to Dehydrated fails on OSD size:M
  • Description updated (diff)
Actions #19

Updated by jbaier_cz almost 3 years ago

  • Description updated (diff)
  • Assignee deleted (jbaier_cz)

I am unassigning myself as the scope has changed, adding a suggestion about failure hook.

Actions #20

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Workable
Actions #23

Updated by nicksinger almost 3 years ago

  • Status changed from Workable to Resolved

jbaier_cz wrote:

is it possible to have a different salt task (probably a "state" in salt terminology) which will delete the dehydrated data and then the standard default state which will again install it and obtain the certificates?

Yes you can do this by adding a "requires" statement to the dehydrated state pointing to another state (e.g. cleanup). But this would renew the account each and every time. Pretty unclean and would introduce yet another couple of requests which could fail ;)

okurz wrote:

so do you have some idea how to recreate the account automatically in case something similar happens again? Just because we don't have dehydrated for that long and this problem already showed up I suggest to look into this.

I use dehydrated with the "Lets Encrypt CA" for several years now on my private machines and never had such issues. I think an automatic retry here is not helpful and rather masks issues of our internal CA. We should raise these issues to infra and explain them our expectation for this service if this happens again. Only if they clearly state that this can happen every time we should invest time and effort in automating the account creation.

Actions #24

Updated by okurz almost 3 years ago

  • Assignee set to nicksinger
Actions #25

Updated by okurz almost 3 years ago

  • Assignee changed from nicksinger to mkittler
Actions

Also available in: Atom PDF