Project

General

Profile

Actions

action #139145

closed

dehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" size:M

Added by okurz 7 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-11-06
Due date:
% Done:

0%

Estimated time:

Description

Observation

Failed grafana alert about "Failed systemd services".

root@monitor:~ # systemctl status dehydrated
. dehydrated.service - Certificate Update Runner for Dehydrated
     Loaded: loaded (/usr/lib/systemd/system/dehydrated.service; static)
     Active: failed (Result: exit-code) since Mon 2023-11-06 00:21:21 CET; 9h ago
TriggeredBy: . dehydrated.timer
   Main PID: 19901 (code=exited, status=1/FAILURE)

Nov 06 00:19:10 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Using main config file /etc/dehydrated/config
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 06 00:19:11 monitor sudo[19901]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 06 00:21:21 monitor dehydrated[19997]: EXPECTED value GOT EOF
Nov 06 00:21:21 monitor systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Nov 06 00:21:21 monitor systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Nov 06 00:21:21 monitor systemd[1]: Failed to start Certificate Update Runner for Dehydrated.

as of 2023-11-13 the service is not in state "failed" anymore so apparently a spurious sporadic problem.

Suggestions

  • Upstream research for error messages
  • Consider to change systemd service to restart automatically on failure

Rollback actions

  • Remove silence for "failed systemd services"
Actions #1

Updated by okurz 7 months ago

  • Priority changed from Normal to High
Actions #2

Updated by okurz 7 months ago

  • Subject changed from dehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" to dehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by jbaier_cz 7 months ago

  • Assignee set to jbaier_cz
Actions #4

Updated by jbaier_cz 7 months ago

  • Status changed from Workable to In Progress

The error looks to me like upstream downtime, the service restart could help with that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1042

Actions #5

Updated by openqa_review 7 months ago

  • Due date set to 2023-11-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by jbaier_cz 7 months ago

The change is deployed and unfortunately we can also already see it is working as the original issue is here:

Nov 16 14:10:20 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Using main config file /etc/dehydrated/config
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 16 14:10:20 monitor sudo[26383]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 16 14:10:39 monitor dehydrated[27616]: EXPECTED value GOT EOF
Nov 16 14:10:39 monitor systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Nov 16 14:10:39 monitor systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Nov 16 14:10:39 monitor systemd[1]: Failed to start Certificate Update Runner for Dehydrated.
Nov 16 14:11:39 monitor systemd[1]: dehydrated.service: Scheduled restart job, restart counter is at 1719.

Running the dehydrated manually, I am able to obtain a little bit more info. The error is during the curl call:

ERROR: Problem connecting to server (post for https://ca-internal.suse.de/acme/acme/challenge/<REDACTED>/<REDACTED>; curl returned with 92)
EXPECTED value GOT EOF
Actions #7

Updated by jbaier_cz 7 months ago

And by using http/1.1 the final error is OpenSSL SSL_read: OpenSSL/1.1.1l-fips: error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac, errno 0, whatever that means. It looks like either a network error or product bug, I am filling SD ticket in any case as I am not probably not able to debug this without access to the other end.

Actions #8

Updated by jbaier_cz 7 months ago

  • Status changed from In Progress to Blocked

Well, this looks exactly like https://sd.suse.com/servicedesk/customer/portal/1/SD-110017 which is already closed (with quite unknown results) so I created a new ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-139253, maybe it will be finally solved.

Actions #9

Updated by jbaier_cz 7 months ago

No reaction in the SD ticket yet, still blocked.

Actions #10

Updated by tinita 7 months ago

  • Due date changed from 2023-11-28 to 2023-12-05

Still no answer in the ticket, but I see https://monitor.qa.suse.de/ has a certificate from 2023-11-23.

# systemctl status dehydrated
○ dehydrated.service - Certificate Update Runner for Dehydrated
     Loaded: loaded (/usr/lib/systemd/system/dehydrated.service; static)
    Drop-In: /etc/systemd/system/dehydrated.service.d
             └─override.conf
     Active: inactive (dead) since Thu 2023-11-30 00:12:33 CET; 11h ago
TriggeredBy: ● dehydrated.timer
   Main PID: 24301 (code=exited, status=0/SUCCESS)

Nov 30 00:12:30 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 30 00:12:30 monitor dehydrated[24301]: # INFO: Using main config file /etc/dehydrated/config
Nov 30 00:12:30 monitor dehydrated[24301]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 30 00:12:31 monitor dehydrated[24301]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 30 00:12:32 monitor sudo[24301]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 30 00:12:33 monitor systemd[1]: dehydrated.service: Deactivated successfully.
Nov 30 00:12:33 monitor systemd[1]: Finished Certificate Update Runner for Dehydrated.

So it seems we are good for now, bumping the due date.

Actions #11

Updated by tinita 6 months ago

  • Due date changed from 2023-12-05 to 2023-12-12
Actions #12

Updated by jbaier_cz 6 months ago · Edited

  • Due date changed from 2023-12-12 to 2023-12-23
  • Priority changed from High to Normal

So far no reaction in the SD ticket, the current certificate is valid until Dec 30 23:32:34 2023 GMT. It seems that it is magically working for now and I am not expecting any reply soon, hence lowering the priority and bumping the due date before the expiration to check again.

Actions #13

Updated by livdywan 6 months ago

jbaier_cz wrote in #note-12:

So far no reaction in the SD ticket, the current certificate is valid until Dec 30 23:32:34 2023 GMT. It seems that it is magically working for now and I am not expecting any reply soon, hence lowering the priority and bumping the due date before the expiration to check again.

Apparently the certificate was renewed? I'm seeing Monday, 15 January 2024 at 00:39:42 here. Also asked on the SD ticket.

Actions #14

Updated by jbaier_cz 6 months ago

Yes, as I mentioned in the previous comment, the process is not working (without any known change on our side). That was also the reason to remove the urgency for this ticket.

Actions #15

Updated by livdywan 6 months ago

  • Due date changed from 2023-12-23 to 2024-01-11

jbaier_cz wrote in #note-14:

Yes, as I mentioned in the previous comment, the process is not working (without any known change on our side). That was also the reason to remove the urgency for this ticket.

Yes... and it was renewed after the expiry you mentioned :-D So I guess we'll wait once more like before.

Actions #16

Updated by livdywan 5 months ago

  • Due date changed from 2024-01-11 to 2024-02-16

It seems like something or someone updated the certificate again. We don't know if this was fixed or chance, though. Let's give it more time as people are still on hols.

Actions #17

Updated by jbaier_cz 4 months ago

  • Due date deleted (2024-02-16)
  • Status changed from Blocked to Resolved

Had a brief discussion about the SD ticket with IT; at this point there is probably not much to do and we will likely need to wait if the issue reappears. The most likely cause was a network issue related to dct-migration.

Actions

Also available in: Atom PDF