action #72238
closedwebsocket connection retry on flaky connections (was: SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose)
Description
Motivation¶
SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose.
Example: https://openqa.opensuse.org/tests/1419808/file/worker-log.txt
[2020-10-05T10:00:06.0611 UTC] [info] 12022: WORKING 1419807
[2020-10-05T10:00:36.0678 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:00:46.0678 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:00:46.0838 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:00:46.0940 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:01:24.0164 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:01:34.0165 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:01:34.0302 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:01:34.0396 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:02:04.0442 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:02:14.0443 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:02:14.0595 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:02:14.0599 UTC] [error] Stopping because a critical error occurred.
[2020-10-05T10:02:14.0599 UTC] [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
[2020-10-05T10:02:14.0695 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:02:17.0709 UTC] [error] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.
[2020-10-05T10:02:17.0827 UTC] [info] Uploading video.ogv
[2020-10-05T10:02:18.0055 UTC] [info] Uploading vars.json
[2020-10-05T10:02:18.0169 UTC] [info] Uploading autoinst-log.txt
[2020-10-05T10:02:18.0304 UTC] [info] Uploading worker-log.txt
When it retries, it succeeds, but at some point, it seems to give up.
Do we reset the retry counter once a successful reconnection happen?
Acceptance criteria¶
- AC1: Given a flaky connection, When some websocket connection tries fail but succeed, Then a subsequent connection fail triggers a retry (rather than immediate fail)
Suggestions¶
- Check (and add test if missing) for the case of connection failing, then succeeding then failing again
Updated by okurz about 4 years ago
- Subject changed from SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose to websocket connection retry on flaky connections (was: SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose)
- Description updated (diff)
- Category set to Feature requests
- Status changed from New to Workable
- Priority changed from Normal to Low
- Target version set to Ready
Updated by ggardet_arm about 4 years ago
It looks like we do not even retry on some error, see: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/WebUIConnection.pm#L219
# note: The worker is supposed to handle this event and e.g. try to re-register again.
Retry mechanism may not apply here? https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/WebUIConnection.pm#L261
Updated by ggardet_arm about 4 years ago
The journal is a bit more verbose on the failure, related to Mojo (cause or consequence?):
Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [info] Registering with openQA https://openqa.opensuse.org Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293 Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: Mojo::Reactor::Poll: Timer failed: Can't call method "unsubscribe" on an undefined value at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 98. Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [error] Stopping because a critical error occurred. Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now. Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293 Oct 05 13:33:10 ip-10-0-0-58 worker[33305]: [info] Isotovideo exit status: 0
Updated by mkittler about 4 years ago
We've also seen this error when running internal tests: https://progress.opensuse.org/issues/72196#note-4
So if you're using Mojolicious 8.60, please upgrade to 8.61.
Updated by ggardet_arm about 4 years ago
I just updated perl-Mojolicious
from 8.60
to 8.61
, using the package from the devel project. Let's see how it goes for the next runs.
Updated by ggardet_arm about 4 years ago
- Status changed from Feedback to Resolved
Still ok after few days. Set to Resolved
.