action #105960: Dehydrated fails on OSD size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #105960

closed

Dehydrated fails on OSD size:M

Added by mkittler almost 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-02-04

Due date:

% Done:

Estimated time:

Tags:

alerts

Description

Observation¶

Feb 04 10:55:19 openqa systemd[1]: Starting Certificate Update Runner for Dehydrated...
Feb 04 10:55:19 openqa dehydrated[6052]: # INFO: Using main config file /etc/dehydrated/config
Feb 04 10:55:19 openqa dehydrated[6052]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Feb 04 10:55:19 openqa dehydrated[6052]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Feb 04 10:55:19 openqa sudo[6052]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Feb 04 10:55:20 openqa dehydrated[6427]: {}
Feb 04 10:55:20 openqa systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 10:55:20 openqa systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Feb 04 10:55:20 openqa systemd[1]: Failed to start Certificate Update Runner for Dehydrated.

Acceptance criteria¶

AC1: OSD is automatically dehydrated on a regular schedule

Suggestions¶

Add a simple systemd timer
Use service.running as a prerequisite state in salt c.f. https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/certificates/dehydrated.sls#L36
See if a new account is needed
Consider request_failure() hook inside https://github.com/dehydrated-io/dehydrated/blob/master/docs/examples/hook.sh

Out of scope¶

qem dashboard (different ticket)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by mkittler almost 3 years ago

Assignee set to mkittler

This is the full error message:

martchus@openqa:~> sudo -u dehydrated dehydrated --cron
# INFO: Using main config file /etc/dehydrated/config
# INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Processing openqa.suse.de with alternative names: openqa.nue.suse.com
 + Checking domain name(s) of existing cert... unchanged.
 + Checking expire date of existing cert...
 + Valid till Feb 26 23:09:44 2022 GMT (Less than 23 days). Renewing!
 + Signing domains...
 + Generating signing request...
 + Requesting new certificate order from CA...
  + ERROR: An error occurred while sending post-request to https://ca-internal.suse.de/acme/acme/new-order (Status 500)

Details:
HTTP/2 500 
cache-control: no-store
content-type: application/json
link: <https://ca-internal.suse.de/acme/acme/directory>;rel="index"
replay-nonce: ZWRLTmM1VTBvUEpBZnM4QkN5UVNyUEZYYk5MOWc4OUc
content-length: 3
date: Fri, 04 Feb 2022 15:48:23 GMT

{}


/usr/bin/dehydrated: Zeile 737: 1 ist nicht gesetzt.

Actions

Copy link

Updated by mkittler almost 3 years ago

Assignee deleted (~~mkittler~~)

Let's talk about / estimate this ticket together next week.

Actions

Copy link

Updated by okurz almost 3 years ago

I guess a retry in a systemd service override would suffice

Actions

Copy link

Updated by jbaier_cz almost 3 years ago

I see the same error on qam2, so it seems something is broken on the CA side:

Feb 07 01:52:32 qam2 systemd[1]: Starting Certificate Update Runner for Dehydrated...
Feb 07 01:52:32 qam2 dehydrated[25822]: # INFO: Using main config file /etc/dehydrated/config
Feb 07 01:52:32 qam2 dehydrated[25822]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Feb 07 01:52:32 qam2 sudo[25822]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Feb 07 01:52:35 qam2 dehydrated[26142]: HTTP/2 500
Feb 07 01:52:35 qam2 dehydrated[26142]: cache-control: no-store
Feb 07 01:52:35 qam2 dehydrated[26142]: content-type: application/json
Feb 07 01:52:35 qam2 dehydrated[26142]: link: <https://ca-internal.suse.de/acme/acme/directory>;rel="index"
Feb 07 01:52:35 qam2 dehydrated[26142]: replay-nonce: RDc3am9aSmZoMDBpaHdPamo3WVp4U0RweXpnU3F3UU4
Feb 07 01:52:35 qam2 dehydrated[26142]: content-length: 3
Feb 07 01:52:35 qam2 dehydrated[26142]: date: Mon, 07 Feb 2022 00:52:35 GMT
Feb 07 01:52:35 qam2 dehydrated[26142]: 
Feb 07 01:52:35 qam2 dehydrated[26143]: {}
Feb 07 01:52:35 qam2 systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Feb 07 01:52:35 qam2 systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Feb 07 01:52:35 qam2 systemd[1]: Failed to start Certificate Update Runner for Dehydrated.

The service itself is started from a timer (at least on qam2) with OnCalendar=daily, so there is no need to add retry to the service (which will fail again).

Actions

Copy link

Updated by okurz almost 3 years ago

Has duplicate action #106035: [qe-tools] dehydrated service fails on osd added

Actions

Copy link

Updated by jbaier_cz almost 3 years ago

Status changed from New to Feedback
Assignee set to jbaier_cz

Reported as SD-76058

Actions

Copy link

Updated by jbaier_cz almost 3 years ago

Based on infra investigation and my assumptions, a package upgrade on the STEP-CA side did something to its internal account database, which can be repaired by "recreating the account", I found out it is sufficient to invoke dehydrated --account. After that the renewal works. Apart from that, they have no idea what went wrong (the server side logs are apparently limited).

On the qam2, I did

dehydrated --account
systemctl start dehydrated.service

and it solved the problem.

Actions

Copy link

#10

Updated by mkittler almost 3 years ago

I moved the old account under /etc/dehydrated/accounts to a backup directory and created a new one via martchus@openqa:~> sudo -u dehydrated dehydrated --register --accept-terms. It now works again and the certificate has been renewed, indeed. (Without "disabling" the old account dehydrated would not create a new one. The new account apparently also has the same hash/checksum/id as the old account.)

Actions

Copy link

#11

Updated by okurz almost 3 years ago

so do you have some idea how to recreate the account automatically in case something similar happens again? Just because we don't have dehydrated for that long and this problem already showed up I suggest to look into this.

Actions

Copy link

#12

Updated by mkittler almost 3 years ago

The alert is off again so. We could try to automate the steps I did in case we get a 500 error but I'm not sure whether it makes generally sense and how to implement it. We'd needed to read the HTTP status code in some "post fail" script.

Actions

Copy link

#13

Updated by okurz almost 3 years ago

do we need a wrapper shell script for dehydrated with something like

dehydrated ...
for i in {1..7}; do
    curl -sS https://openqa.suse.de && exit 0
    sudo rm -rf /etc/dehydrated/accounts
    sudo -u dehydrated dehydrated --register --accept-terms
    sudo systemctl restart dehydrated
done

Actions

Copy link

#14

Updated by mkittler almost 3 years ago

Not sure whether we should forcefully delete all accounts every time we call dehydrated. And restarting the service from itself also doesn't seem to be the best idea. We could add a Restart=… on systemd-level and delete all accounts via ExecStopPost=… if $SERVICE_RESULT is failure.

Actions

Copy link

#15

Updated by okurz almost 3 years ago

mkittler wrote:

Not sure whether we should forcefully delete all accounts every time we call dehydrated.

I did not suggest that. I would do that only after the certificate renew fails.

And restarting the service from itself also doesn't seem to be the best idea. We could add a Restart=… on systemd-level and delete all accounts via ExecStopPost=… if $SERVICE_RESULT is failure.

Well, that could an option but we should favor relying less on systemd and instead have something that we could also call, you know, from a container or cloud instance, bootstrap script, you know.

Actions

Copy link

#16

Updated by mkittler almost 3 years ago

I've also just did the same as on OSD on the monitoring host which had the same problem.

Actions

Copy link

#17

Updated by jbaier_cz almost 3 years ago

If I understood correctly, our certificates are renewed right now. So I propose to wait for the next renewal round in two weeks and see.

I am not a salt expert so I might ask a wrong question: is it possible to have a different salt task (probably a "state" in salt terminology) which will delete the dehydrated data and then the standard default state which will again install it and obtain the certificates?

Actions

Copy link

#18

Updated by livdywan almost 3 years ago

Subject changed from Dehydrated fails on OSD to Dehydrated fails on OSD size:M
Description updated (diff)

Actions

Copy link

#19

Updated by jbaier_cz almost 3 years ago

Description updated (diff)
Assignee deleted (~~jbaier_cz~~)

I am unassigning myself as the scope has changed, adding a suggestion about failure hook.

Actions

Copy link

#20

Updated by okurz almost 3 years ago

Status changed from Feedback to Workable

Actions

Copy link

#23

Updated by nicksinger almost 3 years ago

Status changed from Workable to Resolved

jbaier_cz wrote:

is it possible to have a different salt task (probably a "state" in salt terminology) which will delete the dehydrated data and then the standard default state which will again install it and obtain the certificates?

Yes you can do this by adding a "requires" statement to the dehydrated state pointing to another state (e.g. cleanup). But this would renew the account each and every time. Pretty unclean and would introduce yet another couple of requests which could fail ;)

okurz wrote:

so do you have some idea how to recreate the account automatically in case something similar happens again? Just because we don't have dehydrated for that long and this problem already showed up I suggest to look into this.

I use dehydrated with the "Lets Encrypt CA" for several years now on my private machines and never had such issues. I think an automatic retry here is not helpful and rather masks issues of our internal CA. We should raise these issues to infra and explain them our expectation for this service if this happens again. Only if they clearly state that this can happen every time we should invest time and effort in automating the account creation.

Actions

Copy link

#24

Updated by okurz almost 3 years ago

Assignee set to nicksinger

Actions

Copy link

#25

Updated by okurz almost 3 years ago

Assignee changed from nicksinger to mkittler

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #105960

Dehydrated fails on OSD size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by okurz almost 3 years ago

Updated by jbaier_cz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by jbaier_cz almost 3 years ago

Updated by jbaier_cz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by jbaier_cz almost 3 years ago

Updated by livdywan almost 3 years ago

Updated by jbaier_cz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by nicksinger almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago