Project

General

Profile

Actions

action #169078

closed

dashboard.qam.suse.de SSL certificate not deployed within expiry size:S

Added by livdywan about 2 months ago. Updated 20 days ago.

Status:
Resolved
Priority:
High
Category:
Regressions/Crashes
Start date:
Due date:
2024-11-29
% Done:

0%

Estimated time:

Description

Observation

The certificate for dashboard.qam.suse.de expired on 10/30/2024.

The problem was basically resolved when I started to investigate it. Most likely dehydrated did not restart nginx after renewing the certificate 2 weeks ago like we've seen before.

Oct 17 00:11:15 qam2 dehydrated[28476]: # INFO: Using main config file /etc/dehydrat>
Oct 17 00:11:16 qam2 dehydrated[28476]: Processing qam2.suse.de with alternative nam>
Oct 17 00:11:16 qam2 dehydrated[28476]:  + Checking domain name(s) of existing cert.>
Oct 17 00:11:16 qam2 dehydrated[28476]:  + Checking expire date of existing cert...  
Oct 17 00:11:16 qam2 dehydrated[28476]:  + Valid till Oct 29 23:30:31 2024 GMT (Less>
Oct 17 00:11:16 qam2 dehydrated[28476]:  + Signing domains...                        
Oct 17 00:11:16 qam2 dehydrated[28476]:  + Generating private key...                 
Oct 17 00:11:17 qam2 dehydrated[28476]:  + Generating signing request...             
Oct 17 00:11:17 qam2 dehydrated[28476]:  + Requesting new certificate order from CA.>
Oct 17 00:11:17 qam2 dehydrated[28476]:  + Received 4 authorizations URLs from the CA
Oct 17 00:11:18 qam2 dehydrated[28476]:  + Handling authorization for qam2.suse.de   
Oct 17 00:11:18 qam2 dehydrated[28476]:  + Handling authorization for qam2.qe.prg2.s>
Oct 17 00:11:18 qam2 dehydrated[28476]:  + Handling authorization for qam.suse.de    
Oct 17 00:11:18 qam2 dehydrated[28476]:  + Handling authorization for dashboard.qam.>
Oct 17 00:11:18 qam2 dehydrated[28476]:  + 4 pending challenge(s)                    
Oct 17 00:11:18 qam2 dehydrated[28476]:  + Deploying challenge tokens...             
Oct 17 00:11:18 qam2 dehydrated[28476]:  + Responding to challenge for qam2.suse.de >
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Challenge is valid!                       
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Responding to challenge for qam2.qe.prg2.>
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Challenge is valid!                       
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Responding to challenge for qam.suse.de a>
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Challenge is valid!                       
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Responding to challenge for dashboard.qam>
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Challenge is valid!                       
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Cleaning challenge tokens...              
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Requesting certificate...                 
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Checking certificate...                   
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Done!                                     
Oct 17 00:11:19 qam2 dehydrated[28476]:  + Creating fullchain.pem...                 
Oct 17 00:11:20 qam2 dehydrated[28476]:  + Done!

Acceptance criteria

  • AC1: An updated certificate is used by dashboard.qam.suse.de before the old one expires
  • AC2: NGINX is using the updated certificate, e.g. is reloaded as needed

Suggestions


Related issues 2 (1 open1 closed)

Copied from openQA Infrastructure (public) - action #165434: OSD SSL certificates not always refreshed within expected time, probably only after system reboots size:SResolvednicksinger2024-08-18

Actions
Copied to openQA Infrastructure (public) - action #169357: monitoring+alerting for dashboard.qam.suse.de SSL certificate not deployed within expiry size:SWorkable2024-11-05

Actions
Actions #1

Updated by livdywan about 2 months ago

  • Copied from action #165434: OSD SSL certificates not always refreshed within expected time, probably only after system reboots size:S added
Actions #2

Updated by livdywan about 2 months ago

  • Tags changed from infra, ssl, osd, grafana, dehydrated, alert, qem-dashboard to infra, ssl, osd, grafana, dehydrated, qem-dashboard
  • Description updated (diff)
Actions #3

Updated by mkittler about 1 month ago

  • Subject changed from dashboard.qam.suse.de SSL certificate not deployed within expiry to dashboard.qam.suse.de SSL certificate not deployed within expiry size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz about 1 month ago

  • Copied to action #169357: monitoring+alerting for dashboard.qam.suse.de SSL certificate not deployed within expiry size:S added
Actions #5

Updated by robert.richardson about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to robert.richardson
Actions #6

Updated by openqa_review about 1 month ago

  • Due date set to 2024-11-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by robert.richardson about 1 month ago

  • Status changed from In Progress to Workable

So far i've checked and it does not seem to be the same problem as in #165434 as the dehydrated-postrun-hooks.service exists though has not been in use, if the journal was not cleared.

sudo journalctl -u dehydrated-postrun-hooks.service
-- No entries --

The deploy_cert() in /etc/dehydrated/hook.sh contains the line systemctl reload nginx, though it is for some reason commented out. Maybe this was used previously ?

For now i'm setting this back to workable until i'm done with #169555.
I will next have a look at the ansible playbook to hopefully find out how it was previously done there.

Actions #8

Updated by robert.richardson about 1 month ago

  • Status changed from Workable to In Progress
Actions #9

Updated by robert.richardson about 1 month ago · Edited

I had a look at these dashboard handlers, dehydrated handlers, and this 'Reload NginX' handler (from qamweb) in the qamops repo, i'm not that familier with ansible though if i understand correctly there are no triggers to reload nginx for after the certificates are renewed, so i assume this were also not the methods previously used. @jbaier_cz or @kraih, do you maybe know how the nginx reload/restart was previously called ? In the worst case i can just uncomment the reload line in dehydrateds deploy_cert() function but i'd like to know whats the intended way and what went wrong.

Actions #10

Updated by jbaier_cz about 1 month ago

Those reload handlers in Ansible are for reloading Nginx after qem-dashboard deployment, that has absolutely nothing to do with dehydrated. There is a role to install dehydrated on the server (https://gitlab.suse.de/qa-maintenance/qamops/-/tree/master/ansible/roles/dehydrated?ref_type=heads) but the certificate renewal process is outside of ansible.

Actions #11

Updated by jbaier_cz about 1 month ago

Afaik there is installed package dehydrated-nginx, which provides /etc/dehydrated/postrun-hooks.d/reload-nginx.sh and that was supposed to handle the reload. Almost everything should be mostly default configuration, so the ansible role mentioned earlier just does the installation and sets some urls/e-mails for the internal acme server. No fancy stuff there.

Actions #12

Updated by robert.richardson about 1 month ago

  • Status changed from In Progress to Feedback

ok, thanks for explaining.
If that was the default its weird that the service is not running, nor enabled, and the journal log also has no entries.
Then i would for now just start the postrun-hook service and if nginx reloads successfully works after the next dehydrated run i'll enable it permanently.

Actions #13

Updated by okurz about 1 month ago

  • Status changed from Feedback to Workable
Actions #14

Updated by okurz about 1 month ago

  • Due date changed from 2024-11-21 to 2024-11-29

Extraordinary due date hackweek bump.

Actions #15

Updated by robert.richardson 22 days ago

  • Status changed from Workable to Feedback

Running the postrun hook service manually did reload nginx and therefore the latest certificate is now used, so i have now enabled the service assuming this will resolve the problem, still i havent figured out why it was disabled in the first place.

Should i still keep the ticket open until the next dehydrated run, to make sure everything works ?

Actions #16

Updated by livdywan 22 days ago

  • Status changed from Feedback to Resolved

robert.richardson wrote in #note-15:

Running the postrun hook service manually did reload nginx and therefore the latest certificate is now used, so i have now enabled the service assuming this will resolve the problem, still i havent figured out why it was disabled in the first place.

Should i still keep the ticket open until the next dehydrated run, to make sure everything works ?

If all questions have been answered the ticket is resolved (we used to keep tickets in Feedback but in practice it makes them unactionable). Re-open if it comes back.

Actions #17

Updated by okurz 21 days ago

  • Status changed from Resolved to Workable
  • Priority changed from Normal to High

I did not see any reference to ansible changes or alike to make sure this is persistent as we discussed.

Actions #18

Updated by robert.richardson 20 days ago · Edited

  • Status changed from Workable to Resolved

Had a short chat with @jbaier_cz who helped me enable the service permanently in ansible, MR is merged

Actions

Also available in: Atom PDF