action #139145
closeddehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" size:M
0%
Description
Observation¶
Failed grafana alert about "Failed systemd services".
root@monitor:~ # systemctl status dehydrated
. dehydrated.service - Certificate Update Runner for Dehydrated
Loaded: loaded (/usr/lib/systemd/system/dehydrated.service; static)
Active: failed (Result: exit-code) since Mon 2023-11-06 00:21:21 CET; 9h ago
TriggeredBy: . dehydrated.timer
Main PID: 19901 (code=exited, status=1/FAILURE)
Nov 06 00:19:10 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Using main config file /etc/dehydrated/config
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 06 00:19:10 monitor dehydrated[19901]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 06 00:19:11 monitor sudo[19901]: root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 06 00:21:21 monitor dehydrated[19997]: EXPECTED value GOT EOF
Nov 06 00:21:21 monitor systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Nov 06 00:21:21 monitor systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Nov 06 00:21:21 monitor systemd[1]: Failed to start Certificate Update Runner for Dehydrated.
as of 2023-11-13 the service is not in state "failed" anymore so apparently a spurious sporadic problem.
Suggestions¶
- Upstream research for error messages
- Consider to change systemd service to restart automatically on failure
Rollback actions¶
- Remove silence for "failed systemd services"
Updated by okurz about 1 year ago
- Subject changed from dehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" to dehydrated on monitor.qe.nue2.suse.org aka. monitor.qa.suse.de fails with "EXPECTED value GOT EOF" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz about 1 year ago
- Status changed from Workable to In Progress
The error looks to me like upstream downtime, the service restart could help with that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1042
Updated by openqa_review about 1 year ago
- Due date set to 2023-11-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz about 1 year ago
The change is deployed and unfortunately we can also already see it is working as the original issue is here:
Nov 16 14:10:20 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Using main config file /etc/dehydrated/config
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 16 14:10:20 monitor dehydrated[26383]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 16 14:10:20 monitor sudo[26383]: root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 16 14:10:39 monitor dehydrated[27616]: EXPECTED value GOT EOF
Nov 16 14:10:39 monitor systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Nov 16 14:10:39 monitor systemd[1]: dehydrated.service: Failed with result 'exit-code'.
Nov 16 14:10:39 monitor systemd[1]: Failed to start Certificate Update Runner for Dehydrated.
Nov 16 14:11:39 monitor systemd[1]: dehydrated.service: Scheduled restart job, restart counter is at 1719.
Running the dehydrated manually, I am able to obtain a little bit more info. The error is during the curl call:
ERROR: Problem connecting to server (post for https://ca-internal.suse.de/acme/acme/challenge/<REDACTED>/<REDACTED>; curl returned with 92)
EXPECTED value GOT EOF
Updated by jbaier_cz about 1 year ago
And by using http/1.1 the final error is OpenSSL SSL_read: OpenSSL/1.1.1l-fips: error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac, errno 0
, whatever that means. It looks like either a network error or product bug, I am filling SD ticket in any case as I am not probably not able to debug this without access to the other end.
Updated by jbaier_cz about 1 year ago
- Status changed from In Progress to Blocked
Well, this looks exactly like https://sd.suse.com/servicedesk/customer/portal/1/SD-110017 which is already closed (with quite unknown results) so I created a new ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-139253, maybe it will be finally solved.
Updated by jbaier_cz about 1 year ago
No reaction in the SD ticket yet, still blocked.
Updated by tinita about 1 year ago
- Due date changed from 2023-11-28 to 2023-12-05
Still no answer in the ticket, but I see https://monitor.qa.suse.de/ has a certificate from 2023-11-23.
# systemctl status dehydrated
○ dehydrated.service - Certificate Update Runner for Dehydrated
Loaded: loaded (/usr/lib/systemd/system/dehydrated.service; static)
Drop-In: /etc/systemd/system/dehydrated.service.d
└─override.conf
Active: inactive (dead) since Thu 2023-11-30 00:12:33 CET; 11h ago
TriggeredBy: ● dehydrated.timer
Main PID: 24301 (code=exited, status=0/SUCCESS)
Nov 30 00:12:30 monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Nov 30 00:12:30 monitor dehydrated[24301]: # INFO: Using main config file /etc/dehydrated/config
Nov 30 00:12:30 monitor dehydrated[24301]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Nov 30 00:12:31 monitor dehydrated[24301]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Nov 30 00:12:32 monitor sudo[24301]: root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Nov 30 00:12:33 monitor systemd[1]: dehydrated.service: Deactivated successfully.
Nov 30 00:12:33 monitor systemd[1]: Finished Certificate Update Runner for Dehydrated.
So it seems we are good for now, bumping the due date.
Updated by tinita about 1 year ago
- Due date changed from 2023-12-05 to 2023-12-12
Updated by jbaier_cz about 1 year ago · Edited
- Due date changed from 2023-12-12 to 2023-12-23
- Priority changed from High to Normal
So far no reaction in the SD ticket, the current certificate is valid until Dec 30 23:32:34 2023 GMT. It seems that it is magically working for now and I am not expecting any reply soon, hence lowering the priority and bumping the due date before the expiration to check again.
Updated by livdywan about 1 year ago
jbaier_cz wrote in #note-12:
So far no reaction in the SD ticket, the current certificate is valid until Dec 30 23:32:34 2023 GMT. It seems that it is magically working for now and I am not expecting any reply soon, hence lowering the priority and bumping the due date before the expiration to check again.
Apparently the certificate was renewed? I'm seeing Monday, 15 January 2024 at 00:39:42 here. Also asked on the SD ticket.
Updated by jbaier_cz about 1 year ago
Yes, as I mentioned in the previous comment, the process is not working (without any known change on our side). That was also the reason to remove the urgency for this ticket.
Updated by livdywan almost 1 year ago
- Due date changed from 2023-12-23 to 2024-01-11
jbaier_cz wrote in #note-14:
Yes, as I mentioned in the previous comment, the process is not working (without any known change on our side). That was also the reason to remove the urgency for this ticket.
Yes... and it was renewed after the expiry you mentioned :-D So I guess we'll wait once more like before.
Updated by jbaier_cz 11 months ago
- Due date deleted (
2024-02-16) - Status changed from Blocked to Resolved
Had a brief discussion about the SD ticket with IT; at this point there is probably not much to do and we will likely need to wait if the issue reappears. The most likely cause was a network issue related to dct-migration.