Project

General

Profile

Actions

action #72238

closed

websocket connection retry on flaky connections (was: SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose)

Added by ggardet_arm about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-10-05
Due date:
% Done:

0%

Estimated time:

Description

Motivation

SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose.
Example: https://openqa.opensuse.org/tests/1419808/file/worker-log.txt

[2020-10-05T10:00:06.0611 UTC] [info] 12022: WORKING 1419807
[2020-10-05T10:00:36.0678 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:00:46.0678 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:00:46.0838 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:00:46.0940 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:01:24.0164 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:01:34.0165 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:01:34.0302 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:01:34.0396 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:02:04.0442 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2020-10-05T10:02:14.0443 UTC] [info] Registering with openQA https://openqa.opensuse.org
[2020-10-05T10:02:14.0595 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293
[2020-10-05T10:02:14.0599 UTC] [error] Stopping because a critical error occurred.
[2020-10-05T10:02:14.0599 UTC] [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now.
[2020-10-05T10:02:14.0695 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293
[2020-10-05T10:02:17.0709 UTC] [error] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31.

[2020-10-05T10:02:17.0827 UTC] [info] Uploading video.ogv
[2020-10-05T10:02:18.0055 UTC] [info] Uploading vars.json
[2020-10-05T10:02:18.0169 UTC] [info] Uploading autoinst-log.txt
[2020-10-05T10:02:18.0304 UTC] [info] Uploading worker-log.txt

When it retries, it succeeds, but at some point, it seems to give up.

Do we reset the retry counter once a successful reconnection happen?

Acceptance criteria

  • AC1: Given a flaky connection, When some websocket connection tries fail but succeed, Then a subsequent connection fail triggers a retry (rather than immediate fail)

Suggestions

  • Check (and add test if missing) for the case of connection failing, then succeeding then failing again
Actions

Also available in: Atom PDF