Project

General

Profile

action #72238

Updated by okurz over 3 years ago

## Motivation 

 SLE15-SP2 on AWS M6g (aarch64) machine fails to run a worker properly due to lots of Websocket connections lose. 
 Example: https://openqa.opensuse.org/tests/1419808/file/worker-log.txt 

 ``` 
 [2020-10-05T10:00:06.0611 UTC] [info] 12022: WORKING 1419807 
 [2020-10-05T10:00:36.0678 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds 
 [2020-10-05T10:00:46.0678 UTC] [info] Registering with openQA https://openqa.opensuse.org 
 [2020-10-05T10:00:46.0838 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293 
 [2020-10-05T10:00:46.0940 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293 
 [2020-10-05T10:01:24.0164 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds 
 [2020-10-05T10:01:34.0165 UTC] [info] Registering with openQA https://openqa.opensuse.org 
 [2020-10-05T10:01:34.0302 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293 
 [2020-10-05T10:01:34.0396 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293 
 [2020-10-05T10:02:04.0442 UTC] [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/293 finished by remote side with code 1006, no reason - trying again in 10 seconds 
 [2020-10-05T10:02:14.0443 UTC] [info] Registering with openQA https://openqa.opensuse.org 
 [2020-10-05T10:02:14.0595 UTC] [info] Establishing ws connection via wss://openqa.opensuse.org/api/v1/ws/293 
 [2020-10-05T10:02:14.0599 UTC] [error] Stopping because a critical error occurred. 
 [2020-10-05T10:02:14.0599 UTC] [error] Another error occurred when trying to stop gracefully due to an error. Trying to kill ourself forcefully now. 
 [2020-10-05T10:02:14.0695 UTC] [info] Registered and connected via websockets with openQA host https://openqa.opensuse.org and worker ID 293 
 [2020-10-05T10:02:17.0709 UTC] [error] Upload images subprocess error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/JSON.pm line 31. 

 [2020-10-05T10:02:17.0827 UTC] [info] Uploading video.ogv 
 [2020-10-05T10:02:18.0055 UTC] [info] Uploading vars.json 
 [2020-10-05T10:02:18.0169 UTC] [info] Uploading autoinst-log.txt 
 [2020-10-05T10:02:18.0304 UTC] [info] Uploading worker-log.txt 
 ``` 

 When it retries, it succeeds, but at some point, it seems to give up.  

 Do we reset the retry counter once a successful reconnection happen? 


 ## Acceptance criteria 
 * **AC1:** Given a flaky connection, When some websocket connection tries fail but succeed, Then a subsequent connection fail triggers a retry (rather than immediate fail) 


 ## Suggestions 
 * Check (and add test if missing) for the case of connection failing, then succeeding then failing again

Back