Actions
action #72238
closedwebsocket connection retry on flaky connections (was: SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose)
Description
Motivation¶
SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose.
Example: https://openqa.opensuse.org/tests/1419808/file/worker-log.txt
[2020-10-05T10:00:06.0611 UTC] [info] 12022: WORKING 1419807
[2020-10-05T10:00:36.0678 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:00:46.0678 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:00:46.0838 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:00:46.0940 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:01:24.0164 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:01:34.0165 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:01:34.0302 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:01:34.0396 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:02:04.0442 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:02:14.0443 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:02:14.0595 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:02:14.0599 UTC] [error] Stopping because a critical error occurred.
[2020-10-05T10:02:14.0599 UTC] [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
[2020-10-05T10:02:14.0695 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:02:17.0709 UTC] [error] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.
[2020-10-05T10:02:17.0827 UTC] [info] Uploading video.ogv
[2020-10-05T10:02:18.0055 UTC] [info] Uploading vars.json
[2020-10-05T10:02:18.0169 UTC] [info] Uploading autoinst-log.txt
[2020-10-05T10:02:18.0304 UTC] [info] Uploading worker-log.txt
When it retries, it succeeds, but at some point, it seems to give up.
Do we reset the retry counter once a successful reconnection happen?
Acceptance criteria¶
- AC1: Given a flaky connection, When some websocket connection tries fail but succeed, Then a subsequent connection fail triggers a retry (rather than immediate fail)
Suggestions¶
- Check (and add test if missing) for the case of connection failing, then succeeding then failing again
Actions