action #139145
closed
dehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" size:M
Added by okurz about 1 year ago.
Updated 11 months ago.
Description
Observation¶
Failed grafana alert about "Failed systemd services".
root@monitor:~ # systemctl status dehydrated
. dehydrated.service - Certificate Update Runner for Dehydrated
Loaded: loaded (/usr/lib/systemd/system/dehydrated.service; static)
Active: failed (Result: exit-code) since Mon 2023-11-06 00:21:21 CET; 9h ago
TriggeredBy: . dehydrated.timer
Main PID: 19901 (code=exited, status=1/FAILURE)
Nov 06 00:19:10 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Using main config file /etc/dehydrated/config
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 06 00:19:11 monitor sudo[19901]: root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 06 00:21:21 monitor dehydrated[19997]: EXPECTED value GOT EOF
Nov 06 00:21:21 monitor systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Nov 06 00:21:21 monitor systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Nov 06 00:21:21 monitor systemd[1]: Failed to start Certificate Update Runner for Dehydrated.
as of 2023-11-13 the service is not in state "failed" anymore so apparently a spurious sporadic problem.
Suggestions¶
- Upstream research for error messages
- Consider to change systemd service to restart automatically on failure
Rollback actions¶
- Remove silence for "failed systemd services"
- Priority changed from Normal to High
- Subject changed from dehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" to dehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" size:M
- Description updated (diff)
- Status changed from New to Workable
- Assignee set to jbaier_cz
- Status changed from Workable to In Progress
- Due date set to 2023-11-28
Setting due date based on mean cycle time of SUSE QE Tools
The change is deployed and unfortunately we can also already see it is working as the original issue is here:
Nov 16 14:10:20 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Using main config file /etc/dehydrated/config
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 16 14:10:20 monitor sudo[26383]: root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 16 14:10:39 monitor dehydrated[27616]: EXPECTED value GOT EOF
Nov 16 14:10:39 monitor systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Nov 16 14:10:39 monitor systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Nov 16 14:10:39 monitor systemd[1]: Failed to start Certificate Update Runner for Dehydrated.
Nov 16 14:11:39 monitor systemd[1]: dehydrated.service: Scheduled restart job, restart counter is at 1719.
Running the dehydrated manually, I am able to obtain a little bit more info. The error is during the curl call:
ERROR: Problem connecting to server (post for https://ca-internal.suse.de/acme/acme/challenge/<REDACTED>/<REDACTED>; curl returned with 92)
EXPECTED value GOT EOF
And by using http/1.1 the final error is OpenSSL SSL_read: OpenSSL/1.1.1l-fips: error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac, errno 0
, whatever that means. It looks like either a network error or product bug, I am filling SD ticket in any case as I am not probably not able to debug this without access to the other end.
- Status changed from In Progress to Blocked
No reaction in the SD ticket yet, still blocked.
- Due date changed from 2023-11-28 to 2023-12-05
Still no answer in the ticket, but I see https://monitor.qa.suse.de/ has a certificate from 2023-11-23.
# systemctl status dehydrated
○ dehydrated.service - Certificate Update Runner for Dehydrated
Loaded: loaded (/usr/lib/systemd/system/dehydrated.service; static)
Drop-In: /etc/systemd/system/dehydrated.service.d
└─override.conf
Active: inactive (dead) since Thu 2023-11-30 00:12:33 CET; 11h ago
TriggeredBy: ● dehydrated.timer
Main PID: 24301 (code=exited, status=0/SUCCESS)
Nov 30 00:12:30 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 30 00:12:30 monitor dehydrated[24301]: # INFO: Using main config file /etc/dehydrated/config
Nov 30 00:12:30 monitor dehydrated[24301]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 30 00:12:31 monitor dehydrated[24301]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 30 00:12:32 monitor sudo[24301]: root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 30 00:12:33 monitor systemd[1]: dehydrated.service: Deactivated successfully.
Nov 30 00:12:33 monitor systemd[1]: Finished Certificate Update Runner for Dehydrated.
So it seems we are good for now, bumping the due date.
- Due date changed from 2023-12-05 to 2023-12-12
- Due date changed from 2023-12-12 to 2023-12-23
- Priority changed from High to Normal
So far no reaction in the SD ticket, the current certificate is valid until Dec 30 23:32:34 2023 GMT. It seems that it is magically working for now and I am not expecting any reply soon, hence lowering the priority and bumping the due date before the expiration to check again.
jbaier_cz wrote in #note-12:
So far no reaction in the SD ticket, the current certificate is valid until Dec 30 23:32:34 2023 GMT. It seems that it is magically working for now and I am not expecting any reply soon, hence lowering the priority and bumping the due date before the expiration to check again.
Apparently the certificate was renewed? I'm seeing Monday, 15 January 2024 at 00:39:42 here. Also asked on the SD ticket.
Yes, as I mentioned in the previous comment, the process is not working (without any known change on our side). That was also the reason to remove the urgency for this ticket.
- Due date changed from 2023-12-23 to 2024-01-11
jbaier_cz wrote in #note-14:
Yes, as I mentioned in the previous comment, the process is not working (without any known change on our side). That was also the reason to remove the urgency for this ticket.
Yes... and it was renewed after the expiry you mentioned :-D So I guess we'll wait once more like before.
- Due date changed from 2024-01-11 to 2024-02-16
It seems like something or someone updated the certificate again. We don't know if this was fixed or chance, though. Let's give it more time as people are still on hols.
- Due date deleted (
2024-02-16)
- Status changed from Blocked to Resolved
Had a brief discussion about the SD ticket with IT; at this point there is probably not much to do and we will likely need to wait if the issue reappears. The most likely cause was a network issue related to dct-migration.
Also available in: Atom
PDF