Project

General

Profile

Actions

action #178147

closed

coordination #154768: [saga][epic][ux] State-of-art user experience for openQA

coordination #154771: [epic] Improved test developer user experience

ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" size:S

Added by gpathak about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

We received several e-mail from logwarn about the following errors:

[2025-03-02T07:24:15.173127Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173191Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173257Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173325Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.173392Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)
[2025-03-02T07:24:15.884340Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (9 attempts left)
[2025-03-02T07:24:15.884558Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (9 attempts left)
[2025-03-02T07:24:15.884680Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (9 attempts left)
[2025-03-02T07:24:15.884774Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (9 attempts left)
[2025-03-02T07:24:15.884866Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (9 attempts left)
[2025-03-02T07:24:15.884973Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (9 attempts left)
[2025-03-02T07:24:15.885071Z] [error] [pid:32103] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed

Acceptance Criteria

  • AC1: Only one error for any comment event not published via AMQP is visible in logs
  • AC2: Errors can be traced to specific comments

Suggestions

  • Look into implementation of comment posting in openQA / restart with comment features
  • The attempt should really stop after 9 attempts to avoid alert fatigue
    • Maybe this was many requests in parallel which all subsequently succeeded?
  • Make the logs more verbose e.g. to see what comment was to be created, showing an ID or text or URL
  • The delay between subsequent retry attempts should increase exponentially
    > Exponential backoff is a standard error handling strategy for network applications in which a client periodically retries a failed request with increasing delays between requests

Rollback Steps


Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left)Resolvedmkittler2022-02-03

Actions
Copied from openQA Project (public) - action #168091: logwarn: "Publishing opensuse.openqa.job.done failed: SSL connect attempt failed"Resolvedokurz2024-10-10

Actions
Actions #1

Updated by gpathak about 2 months ago

  • Copied from action #168091: logwarn: "Publishing opensuse.openqa.job.done failed: SSL connect attempt failed" added
Actions #2

Updated by gpathak about 2 months ago

  • Description updated (diff)
Actions #3

Updated by okurz about 2 months ago

  • Parent task set to #154771
Actions #4

Updated by tinita about 1 month ago

  • Subject changed from ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" to ariel.suse-dmz.opensuse.org: " Publishing opensuse.openqa.comment.create failed: Can't connect: System error (9 attempts left)" size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #6

Updated by mkittler about 1 month ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback

Let's see whether those are about different comments/jobs: https://github.com/os-autoinst/openQA/pull/6292

Actions #7

Updated by livdywan about 1 month ago

  • Status changed from Feedback to Workable
  • Priority changed from Normal to High

Unfortunately this is still resulting in multiple emails with many redundant error messages:

[2025-03-16T06:47:19.984452Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726555, job ID: 4926463, 9 attempts left)
[2025-03-16T06:47:19.984601Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726556, job ID: 4926756, 9 attempts left)
[2025-03-16T06:47:19.984691Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726557, job ID: 4926571, 9 attempts left)
[2025-03-16T06:47:19.984771Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726558, job ID: 4926560, 9 attempts left)
[2025-03-16T06:47:19.984847Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726559, job ID: 4926564, 9 attempts left)
[2025-03-16T06:47:19.984924Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: Can't connect: System error (event ID: 726560, job ID: 4926779, 9 attempts left)
[...]
[2025-03-16T06:47:21.036675Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (event ID: 726548, job ID: 4926327, 9 attempts left)
[2025-03-16T06:47:21.036758Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (event ID: 726546, job ID: 4926335, 9 attempts left)
[2025-03-16T06:47:21.036839Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
 (event ID: 726542, job ID: 4926835, 9 attempts left)
[2025-03-16T06:47:21.036924Z] [error] [pid:4360] Publishing opensuse.openqa.comment.create failed: SSL connect attempt failed error:0A000086:SSL routines::certificate verify failed
Actions #8

Updated by tinita about 1 month ago ยท Edited

But at least it shows now that it is about different job ids which were restarted together: https://openqa.opensuse.org/tests/4926463#comment-726555

The multiple emails for the same log lines are because of a logreport bug, otherwise we would have gotten only one email.

Actions #9

Updated by tinita about 1 month ago

Maybe we should only log an error if the last retry failed.

Actions #10

Updated by okurz about 1 month ago

tinita wrote in #note-9:

Maybe we should only log an error if the last retry failed.

I agree. As long as we retry we should not log with level "error", maybe "warn", maybe even only "info" depending on how those levels are handled by logwarn.

Actions #11

Updated by okurz about 1 month ago

  • Priority changed from High to Urgent

multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/

Actions #12

Updated by livdywan about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee changed from mkittler to livdywan

okurz wrote in #note-11:

multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/

I was just thinking the same. PR coming up shortly.

Actions #13

Updated by livdywan about 1 month ago

livdywan wrote in #note-12:

okurz wrote in #note-11:

multiple repeated alerts from today. please look into mitigations: https://github.com/os-autoinst/openqa-logwarn/

I was just thinking the same. PR coming up shortly.

https://github.com/os-autoinst/openqa-logwarn/pull/51

Actions #14

Updated by livdywan about 1 month ago

  • Status changed from In Progress to Workable
  • Assignee changed from livdywan to mkittler
  • Priority changed from Urgent to High
Actions #15

Updated by okurz about 1 month ago

  • Priority changed from High to Urgent

livdywan wrote in #note-14:

Mitigation in place.

not effective. There were multiple emails to o3-admins@suse.de, e.g. 0810Z today with text "[2025-03-19T07:15:03.629005Z] [error] [pid:22325] Publishing opensuse.openqa.job.restart failed: Connect timeout (event ID: 4933429, job ID: none, 9 attempts left)"

Actions #16

Updated by tinita about 1 month ago

  • Related to action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left) added
Actions #17

Updated by tinita about 1 month ago

Mitigation: https://github.com/os-autoinst/openqa-logwarn/pull/57 Ignore RabbitMQ Publishing errors
This is basically the same as #105903 just a slightly different error message.

It will now ignore any event, not just comment.create

Actions #18

Updated by tinita about 1 month ago

  • Priority changed from Urgent to High
Actions #19

Updated by tinita about 1 month ago

  • Description updated (diff)
Actions #20

Updated by tinita about 1 month ago

[2025-03-19T08:14:51.686175Z] [error] [pid:15426] Publishing opensuse.openqa.job.done failed: Connect timeout (event ID: 4933738, job ID: none, 9 attempts left)
[2025-03-19T07:15:03.629005Z] [error] [pid:22325] Publishing opensuse.openqa.job.restart failed: Connect timeout (event ID: 4933429, job ID: none, 9 attempts left)

it looks like the event ID reported here is actually the job id

Actions #21

Updated by mkittler about 1 month ago

The job ID is shown as "job ID" correctly for comment.create log messages. Maybe it would be less confusing to leave out the "job ID" part completely when not present.

I'll change the log message so an error is only logged when there are no attempts left.

Actions #22

Updated by mkittler about 1 month ago

  • Status changed from Workable to Feedback
Actions #23

Updated by okurz about 1 month ago

  • Status changed from Feedback to In Progress

https://github.com/os-autoinst/openQA/pull/6308 merged and deployed on o3. Now rollback actions plz

Actions #24

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback
Actions #25

Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

Done but keeping #105903 open as this was about a slightly different error message.

Actions #26

Updated by tinita about 1 month ago

  • Status changed from Resolved to Workable

I mentioned #105903 in the rollback steps. And that we can possible remove all ignores related to Publishing.
Why did you decide otherwise?

Actions #27

Updated by tinita about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee changed from mkittler to tinita
Actions #28

Updated by tinita about 1 month ago

  • Status changed from In Progress to Resolved
  • Assignee changed from tinita to mkittler
Actions

Also available in: Atom PDF