action #169078
closed
dashboard.qam.suse.de SSL certificate not deployed within expiry size:S
Added by livdywan 7 months ago.
Updated 6 months ago.
Category:
Regressions/Crashes
Description
Observation¶
The certificate for dashboard.qam.suse.de expired on 10/30/2024.
The problem was basically resolved when I started to investigate it. Most likely dehydrated did not restart nginx after renewing the certificate 2 weeks ago like we've seen before.
Oct 17 00:11:15 qam2 dehydrated[28476]: # INFO: Using main config file /etc/dehydrat>
Oct 17 00:11:16 qam2 dehydrated[28476]: Processing qam2.suse.de with alternative nam>
Oct 17 00:11:16 qam2 dehydrated[28476]: + Checking domain name(s) of existing cert.>
Oct 17 00:11:16 qam2 dehydrated[28476]: + Checking expire date of existing cert...
Oct 17 00:11:16 qam2 dehydrated[28476]: + Valid till Oct 29 23:30:31 2024 GMT (Less>
Oct 17 00:11:16 qam2 dehydrated[28476]: + Signing domains...
Oct 17 00:11:16 qam2 dehydrated[28476]: + Generating private key...
Oct 17 00:11:17 qam2 dehydrated[28476]: + Generating signing request...
Oct 17 00:11:17 qam2 dehydrated[28476]: + Requesting new certificate order from CA.>
Oct 17 00:11:17 qam2 dehydrated[28476]: + Received 4 authorizations URLs from the CA
Oct 17 00:11:18 qam2 dehydrated[28476]: + Handling authorization for qam2.suse.de
Oct 17 00:11:18 qam2 dehydrated[28476]: + Handling authorization for qam2.qe.prg2.s>
Oct 17 00:11:18 qam2 dehydrated[28476]: + Handling authorization for qam.suse.de
Oct 17 00:11:18 qam2 dehydrated[28476]: + Handling authorization for dashboard.qam.>
Oct 17 00:11:18 qam2 dehydrated[28476]: + 4 pending challenge(s)
Oct 17 00:11:18 qam2 dehydrated[28476]: + Deploying challenge tokens...
Oct 17 00:11:18 qam2 dehydrated[28476]: + Responding to challenge for qam2.suse.de >
Oct 17 00:11:19 qam2 dehydrated[28476]: + Challenge is valid!
Oct 17 00:11:19 qam2 dehydrated[28476]: + Responding to challenge for qam2.qe.prg2.>
Oct 17 00:11:19 qam2 dehydrated[28476]: + Challenge is valid!
Oct 17 00:11:19 qam2 dehydrated[28476]: + Responding to challenge for qam.suse.de a>
Oct 17 00:11:19 qam2 dehydrated[28476]: + Challenge is valid!
Oct 17 00:11:19 qam2 dehydrated[28476]: + Responding to challenge for dashboard.qam>
Oct 17 00:11:19 qam2 dehydrated[28476]: + Challenge is valid!
Oct 17 00:11:19 qam2 dehydrated[28476]: + Cleaning challenge tokens...
Oct 17 00:11:19 qam2 dehydrated[28476]: + Requesting certificate...
Oct 17 00:11:19 qam2 dehydrated[28476]: + Checking certificate...
Oct 17 00:11:19 qam2 dehydrated[28476]: + Done!
Oct 17 00:11:19 qam2 dehydrated[28476]: + Creating fullchain.pem...
Oct 17 00:11:20 qam2 dehydrated[28476]: + Done!
Acceptance criteria¶
-
AC1: An updated certificate is used by dashboard.qam.suse.de before the old one expires
-
AC2: NGINX is using the updated certificate, e.g. is reloaded as needed
Suggestions¶
- Copied from action #165434: OSD SSL certificates not always refreshed within expected time, probably only after system reboots size:S added
- Tags changed from infra, ssl, osd, grafana, dehydrated, alert, qem-dashboard to infra, ssl, osd, grafana, dehydrated, qem-dashboard
- Description updated (diff)
- Subject changed from dashboard.qam.suse.de SSL certificate not deployed within expiry to dashboard.qam.suse.de SSL certificate not deployed within expiry size:S
- Description updated (diff)
- Status changed from New to Workable
- Copied to action #169357: monitoring+alerting for dashboard.qam.suse.de SSL certificate not deployed within expiry size:S added
- Status changed from Workable to In Progress
- Assignee set to robert.richardson
- Due date set to 2024-11-21
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Workable
So far i've checked and it does not seem to be the same problem as in #165434 as the dehydrated-postrun-hooks.service
exists though has not been in use, if the journal was not cleared.
sudo journalctl -u dehydrated-postrun-hooks.service
-- No entries --
The deploy_cert()
in /etc/dehydrated/hook.sh
contains the line systemctl reload nginx
, though it is for some reason commented out. Maybe this was used previously ?
For now i'm setting this back to workable until i'm done with #169555.
I will next have a look at the ansible playbook to hopefully find out how it was previously done there.
- Status changed from Workable to In Progress
I had a look at these dashboard handlers, dehydrated handlers, and this 'Reload NginX' handler (from qamweb) in the qamops repo, i'm not that familier with ansible though if i understand correctly there are no triggers to reload nginx for after the certificates are renewed, so i assume this were also not the methods previously used. @jbaier_cz or @kraih, do you maybe know how the nginx reload/restart was previously called ? In the worst case i can just uncomment the reload line in dehydrateds deploy_cert() function but i'd like to know whats the intended way and what went wrong.
Afaik there is installed package dehydrated-nginx, which provides /etc/dehydrated/postrun-hooks.d/reload-nginx.sh and that was supposed to handle the reload. Almost everything should be mostly default configuration, so the ansible role mentioned earlier just does the installation and sets some urls/e-mails for the internal acme server. No fancy stuff there.
- Status changed from In Progress to Feedback
ok, thanks for explaining.
If that was the default its weird that the service is not running, nor enabled, and the journal log also has no entries.
Then i would for now just start the postrun-hook service and if nginx reloads successfully works after the next dehydrated run i'll enable it permanently.
- Status changed from Feedback to Workable
- Due date changed from 2024-11-21 to 2024-11-29
Extraordinary due date hackweek bump.
- Status changed from Workable to Feedback
Running the postrun hook service manually did reload nginx and therefore the latest certificate is now used, so i have now enabled the service assuming this will resolve the problem, still i havent figured out why it was disabled in the first place.
Should i still keep the ticket open until the next dehydrated run, to make sure everything works ?
- Status changed from Feedback to Resolved
robert.richardson wrote in #note-15:
Running the postrun hook service manually did reload nginx and therefore the latest certificate is now used, so i have now enabled the service assuming this will resolve the problem, still i havent figured out why it was disabled in the first place.
Should i still keep the ticket open until the next dehydrated run, to make sure everything works ?
If all questions have been answered the ticket is resolved (we used to keep tickets in Feedback but in practice it makes them unactionable). Re-open if it comes back.
- Status changed from Resolved to Workable
- Priority changed from Normal to High
I did not see any reference to ansible changes or alike to make sure this is persistent as we discussed.
- Status changed from Workable to Resolved
- Related to action #179149: Cron <root@ariel> /usr/bin/dehydrated --cron | /opt/os-autoinst-scripts/filter-dehydrated-cron-output "ERROR: An error occurred while sending post-request to https://acme-v02.api.letsencrypt.org/… (Status 503)" size:S added
Also available in: Atom
PDF