action #178147
closedcoordination #154768: [saga][epic][ux] State-of-art user experience for openQA
coordination #154771: [epic] Improved test developer user experience
ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" size:S
0%
Description
Observation¶
We received several e-mail from logwarn about the following errors:
[2025-03-02T07:24:15.173127Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173191Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173257Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173325Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173392Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.884340Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884558Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884680Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884774Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884866Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884973Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.885071Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
Acceptance Criteria¶
- AC1: Only one error for any comment event not published via AMQP is visible in logs
- AC2: Errors can be traced to specific comments
Suggestions¶
- Look into implementation of comment posting in openQA / restart with comment features
- The attempt should really stop after 9 attempts to avoid alert fatigue
- Maybe this was many requests in parallel which all subsequently succeeded?
- Make the logs more verbose e.g. to see what comment was to be created, showing an ID or text or URL
- The delay between subsequent retry attempts should increase exponentially
Exponential backoff is a standard error handling strategy for network applications in which a client periodically retries a failed request with increasing delays between requests
Rollback Steps¶
- Remove ignoring
Publishing
from https://github.com/os-autoinst/openqa-logwarn (there are at least 3 PRs related to that, one very old, so removing the lines is easier than reverting the 3 PRs) - Close #105903 as well when done
Updated by gpathak 2 months ago
- Copied from action #168091: logwarn: "Publishing opensuse.openqa.job.done failed: SSL connect attempt failed" added
Updated by tinita 2 months ago
- Subject changed from ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" to ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan about 2 months ago
- Status changed from Feedback to Workable
- Priority changed from Normal to High
Unfortunately this is still resulting in multiple emails with many redundant error messages:
[2025-03-16T06:47:19.984452Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726555, job ID: 4926463, 9 attempts left)
[2025-03-16T06:47:19.984601Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726556, job ID: 4926756, 9 attempts left)
[2025-03-16T06:47:19.984691Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726557, job ID: 4926571, 9 attempts left)
[2025-03-16T06:47:19.984771Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726558, job ID: 4926560, 9 attempts left)
[2025-03-16T06:47:19.984847Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726559, job ID: 4926564, 9 attempts left)
[2025-03-16T06:47:19.984924Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726560, job ID: 4926779, 9 attempts left)
[...]
[2025-03-16T06:47:21.036675Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(event ID: 726548, job ID: 4926327, 9 attempts left)
[2025-03-16T06:47:21.036758Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(event ID: 726546, job ID: 4926335, 9 attempts left)
[2025-03-16T06:47:21.036839Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(event ID: 726542, job ID: 4926835, 9 attempts left)
[2025-03-16T06:47:21.036924Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
Updated by tinita about 2 months ago ยท Edited
But at least it shows now that it is about different job ids which were restarted together: https://openqa.opensuse.org/tests/4926463#comment-726555
The multiple emails for the same log lines are because of a logreport bug, otherwise we would have gotten only one email.
Updated by tinita about 2 months ago
Maybe we should only log an error if the last retry failed.
Updated by okurz about 2 months ago
tinita wrote in #note-9:
Maybe we should only log an error if the last retry failed.
I agree. As long as we retry we should not log with level "error", maybe "warn", maybe even only "info" depending on how those levels are handled by logwarn.
Updated by okurz about 2 months ago
- Priority changed from High to Urgent
multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/
Updated by livdywan about 2 months ago
- Status changed from Workable to In Progress
- Assignee changed from mkittler to livdywan
okurz wrote in #note-11:
multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/
I was just thinking the same. PR coming up shortly.
Updated by livdywan about 2 months ago
livdywan wrote in #note-12:
okurz wrote in #note-11:
multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/
I was just thinking the same. PR coming up shortly.
Updated by livdywan about 2 months ago
- Status changed from In Progress to Workable
- Assignee changed from livdywan to mkittler
- Priority changed from Urgent to High
Mitigation in place.
Updated by okurz about 2 months ago
- Priority changed from High to Urgent
livdywan wrote in #note-14:
Mitigation in place.
not effective. There were multiple emails to o3-admins@suse.de, e.g. 0810Z today with text "[2025-03-19T07:15:03.629005Z] [error] [pid:22325] Publishing opensuse.openqa.job.restart failed: Connect timeout (event ID: 4933429, job ID: none, 9 attempts left)"
Updated by tinita about 2 months ago
- Related to action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left) added
Updated by tinita about 2 months ago
Mitigation: https://github.com/os-autoinst/openqa-logwarn/pull/57 Ignore RabbitMQ Publishing errors
This is basically the same as #105903 just a slightly different error message.
It will now ignore any event, not just comment.create
Updated by tinita about 2 months ago
- Priority changed from Urgent to High
Mitigation https://github.com/os-autoinst/openqa-logwarn/pull/57 merged
Updated by tinita about 2 months ago
[2025-03-19T08:14:51.686175Z] [error] [pid:15426] Publishing opensuse.openqa.job.done failed: Connect timeout (event ID: 4933738, job ID: none, 9 attempts left)
[2025-03-19T07:15:03.629005Z] [error] [pid:22325] Publishing opensuse.openqa.job.restart failed: Connect timeout (event ID: 4933429, job ID: none, 9 attempts left)
it looks like the event ID reported here is actually the job id
Updated by mkittler about 2 months ago
The job ID is shown as "job ID" correctly for comment.create
log messages. Maybe it would be less confusing to leave out the "job ID" part completely when not present.
I'll change the log message so an error is only logged when there are no attempts left.
Updated by mkittler about 2 months ago
- Status changed from Workable to Feedback
Updated by okurz about 2 months ago
- Status changed from Feedback to In Progress
https://github.com/os-autoinst/openQA/pull/6308 merged and deployed on o3. Now rollback actions plz
Updated by mkittler about 2 months ago
- Status changed from In Progress to Feedback
PR for rollback: https://github.com/os-autoinst/openqa-logwarn/pull/59
Updated by mkittler about 2 months ago
- Status changed from Feedback to Resolved
Done but keeping #105903 open as this was about a slightly different error message.
Updated by tinita about 2 months ago
- Status changed from Resolved to Workable
I mentioned #105903 in the rollback steps. And that we can possible remove all ignores related to Publishing
.
Why did you decide otherwise?
Updated by tinita about 2 months ago
- Status changed from Workable to In Progress
- Assignee changed from mkittler to tinita
Updated by tinita about 2 months ago
- Status changed from In Progress to Resolved
- Assignee changed from tinita to mkittler