Project

General

Profile

action #72238

websocket connection retry on flaky connections (was: SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose)

Added by ggardet_arm 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-10-05
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose.
Example: https://openqa.opensuse.org/tests/1419808/file/worker-log.txt

[2020-10-05T10:00:06.0611 UTC] [info] 12022: WORKING 1419807
[2020-10-05T10:00:36.0678 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:00:46.0678 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:00:46.0838 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:00:46.0940 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:01:24.0164 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:01:34.0165 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:01:34.0302 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:01:34.0396 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:02:04.0442 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:02:14.0443 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:02:14.0595 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:02:14.0599 UTC] [error] Stopping because a critical error occurred.
[2020-10-05T10:02:14.0599 UTC] [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
[2020-10-05T10:02:14.0695 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:02:17.0709 UTC] [error] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[2020-10-05T10:02:17.0827 UTC] [info] Uploading video.ogv
[2020-10-05T10:02:18.0055 UTC] [info] Uploading vars.json
[2020-10-05T10:02:18.0169 UTC] [info] Uploading autoinst-log.txt
[2020-10-05T10:02:18.0304 UTC] [info] Uploading worker-log.txt

When it retries, it succeeds, but at some point, it seems to give up.

Do we reset the retry counter once a successful reconnection happen?

Acceptance criteria

  • AC1: Given a flaky connection, When some websocket connection tries fail but succeed, Then a subsequent connection fail triggers a retry (rather than immediate fail)

Suggestions

  • Check (and add test if missing) for the case of connection failing, then succeeding then failing again

History

#1 Updated by okurz 4 months ago

  • Subject changed from SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose to websocket connection retry on flaky connections (was: SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose)
  • Description updated (diff)
  • Category set to Feature requests
  • Status changed from New to Workable
  • Priority changed from Normal to Low
  • Target version set to Ready

#2 Updated by ggardet_arm 4 months ago

It looks like we do not even retry on some error, see: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/WebUIConnection.pm#L219

 # note: The worker is supposed to handle this event and e.g. try to re-register again.

Retry mechanism may not apply here? https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/WebUIConnection.pm#L261

#3 Updated by ggardet_arm 4 months ago

The journal is a bit more verbose on the failure, related to Mojo (cause or consequence?):

Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [info] Registering with openQA https://openqa.opensuse.org                                                                                                                                       Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293                                                                                                                    Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98.                                            Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [error] Stopping because a critical error occurred.                                                                                                                                              Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.                                                                            Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293                                                                                    Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [info] Isotovideo exit status: 0

#4 Updated by mkittler 3 months ago

We've also seen this error when running internal tests: https://progress.opensuse.org/issues/72196#note-4

So if you're using Mojolicious 8.60, please upgrade to 8.61.

#5 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#6 Updated by mkittler 3 months ago

  • Status changed from Workable to Feedback

#7 Updated by ggardet_arm 3 months ago

I just updated perl-Mojolicious from 8.60 to 8.61, using the package from the devel project. Let's see how it goes for the next runs.

#8 Updated by ggardet_arm 3 months ago

  • Status changed from Feedback to Resolved

Still ok after few days. Set to Resolved.

Also available in: Atom PDF