action #178147
closedcoordination #154768: [saga][epic][ux] State-of-art user experience for openQA
coordination #154771: [epic] Improved test developer user experience
ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" size:S
0%
Description
Observation¶
We received several e-mail from logwarn about the following errors:
[2025-03-02T07:24:15.173127Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173191Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173257Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173325Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173392Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.884340Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884558Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884680Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884774Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884866Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.884973Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(9 attempts left)
[2025-03-02T07:24:15.885071Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
Acceptance Criteria¶
- AC1: Only one error for any comment event not published via AMQP is visible in logs
- AC2: Errors can be traced to specific comments
Suggestions¶
- Look into implementation of comment posting in openQA / restart with comment features
- The attempt should really stop after 9 attempts to avoid alert fatigue
- Maybe this was many requests in parallel which all subsequently succeeded?
- Make the logs more verbose e.g. to see what comment was to be created, showing an ID or text or URL
- The delay between subsequent retry attempts should increase exponentially
> Exponential backoff is a standard error handling strategy for network applications in which a client periodically retries a failed request with increasing delays between requests
Rollback Steps¶
- Remove ignoring
Publishing
from https://github.com/os-autoinst/openqa-logwarn (there are at least 3 PRs related to that, one very old, so removing the lines is easier than reverting the 3 PRs) - Close #105903 as well when done
Updated by gpathak about 2 months ago
- Copied from action #168091: logwarn: "Publishing opensuse.openqa.job.done failed: SSL connect attempt failed" added
Updated by tinita about 1 month ago
- Subject changed from ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" to ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler about 1 month ago
- Description updated (diff)
- Status changed from In Progress to Feedback
Let's see whether those are about different comments/jobs: https://github.com/os-autoinst/openQA/pull/6292
Updated by livdywan about 1 month ago
- Status changed from Feedback to Workable
- Priority changed from Normal to High
Unfortunately this is still resulting in multiple emails with many redundant error messages:
[2025-03-16T06:47:19.984452Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726555, job ID: 4926463, 9 attempts left)
[2025-03-16T06:47:19.984601Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726556, job ID: 4926756, 9 attempts left)
[2025-03-16T06:47:19.984691Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726557, job ID: 4926571, 9 attempts left)
[2025-03-16T06:47:19.984771Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726558, job ID: 4926560, 9 attempts left)
[2025-03-16T06:47:19.984847Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726559, job ID: 4926564, 9 attempts left)
[2025-03-16T06:47:19.984924Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726560, job ID: 4926779, 9 attempts left)
[...]
[2025-03-16T06:47:21.036675Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(event ID: 726548, job ID: 4926327, 9 attempts left)
[2025-03-16T06:47:21.036758Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(event ID: 726546, job ID: 4926335, 9 attempts left)
[2025-03-16T06:47:21.036839Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
(event ID: 726542, job ID: 4926835, 9 attempts left)
[2025-03-16T06:47:21.036924Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
Updated by tinita about 1 month ago ยท Edited
But at least it shows now that it is about different job ids which were restarted together: https://openqa.opensuse.org/tests/4926463#comment-726555
The multiple emails for the same log lines are because of a logreport bug, otherwise we would have gotten only one email.
Updated by tinita about 1 month ago
Maybe we should only log an error if the last retry failed.
Updated by okurz about 1 month ago
tinita wrote in #note-9:
Maybe we should only log an error if the last retry failed.
I agree. As long as we retry we should not log with level "error", maybe "warn", maybe even only "info" depending on how those levels are handled by logwarn.
Updated by okurz about 1 month ago
- Priority changed from High to Urgent
multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/
Updated by livdywan about 1 month ago
- Status changed from Workable to In Progress
- Assignee changed from mkittler to livdywan
okurz wrote in #note-11:
multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/
I was just thinking the same. PR coming up shortly.
Updated by livdywan about 1 month ago
livdywan wrote in #note-12:
okurz wrote in #note-11:
multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/
I was just thinking the same. PR coming up shortly.
Updated by livdywan about 1 month ago
- Status changed from In Progress to Workable
- Assignee changed from livdywan to mkittler
- Priority changed from Urgent to High
Mitigation in place.
Updated by okurz about 1 month ago
- Priority changed from High to Urgent
livdywan wrote in #note-14:
Mitigation in place.
not effective. There were multiple emails to o3-admins@suse.de, e.g. 0810Z today with text "[2025-03-19T07:15:03.629005Z] [error] [pid:22325] Publishing opensuse.openqa.job.restart failed: Connect timeout (event ID: 4933429, job ID: none, 9 attempts left)"
Updated by tinita about 1 month ago
- Related to action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left) added
Updated by tinita about 1 month ago
Mitigation: https://github.com/os-autoinst/openqa-logwarn/pull/57 Ignore RabbitMQ Publishing errors
This is basically the same as #105903 just a slightly different error message.
It will now ignore any event, not just comment.create
Updated by tinita about 1 month ago
- Priority changed from Urgent to High
Mitigation https://github.com/os-autoinst/openqa-logwarn/pull/57 merged
Updated by tinita about 1 month ago
[2025-03-19T08:14:51.686175Z] [error] [pid:15426] Publishing opensuse.openqa.job.done failed: Connect timeout (event ID: 4933738, job ID: none, 9 attempts left)
[2025-03-19T07:15:03.629005Z] [error] [pid:22325] Publishing opensuse.openqa.job.restart failed: Connect timeout (event ID: 4933429, job ID: none, 9 attempts left)
it looks like the event ID reported here is actually the job id
Updated by mkittler about 1 month ago
The job ID is shown as "job ID" correctly for comment.create
log messages. Maybe it would be less confusing to leave out the "job ID" part completely when not present.
I'll change the log message so an error is only logged when there are no attempts left.
Updated by mkittler about 1 month ago
- Status changed from Workable to Feedback
Updated by okurz about 1 month ago
- Status changed from Feedback to In Progress
https://github.com/os-autoinst/openQA/pull/6308 merged and deployed on o3. Now rollback actions plz
Updated by mkittler about 1 month ago
- Status changed from In Progress to Feedback
PR for rollback: https://github.com/os-autoinst/openqa-logwarn/pull/59
Updated by mkittler about 1 month ago
- Status changed from Feedback to Resolved
Done but keeping #105903 open as this was about a slightly different error message.
Updated by tinita about 1 month ago
- Status changed from Resolved to Workable
I mentioned #105903 in the rollback steps. And that we can possible remove all ignores related to Publishing
.
Why did you decide otherwise?
Updated by tinita about 1 month ago
- Status changed from Workable to In Progress
- Assignee changed from mkittler to tinita
Updated by tinita about 1 month ago
- Status changed from In Progress to Resolved
- Assignee changed from tinita to mkittler