Project

General

Profile

action #134924

Updated by mkittler 9 months ago

### Observation 
 When debugging the OSD worker 40 after the VM migration I've noticed that many worker slots are shown as "broken" with the reason "graceful disconnect …". This looks weird. The worker slot's journal reveals that the worker is really just waiting for the websocket server to respond: 

 ``` 
 Aug 31 12:11:55 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de 
 Aug 31 12:11:56 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108 
 Aug 31 12:16:56 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds 
 Aug 31 12:17:06 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de 
 Aug 31 12:17:10 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108 
 Aug 31 12:22:10 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds 
 Aug 31 12:22:20 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de 
 Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108 
 Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3108 
 ``` 

 The worker first registers via the API and then establishes a websocket connection. Here we can see that the establishing the websocket connection timed out after 5 minutes (likely hitting the gateway timeout). It was then retried and the websocket server was still quite slot but at least the timeout wasn't exceeded anymore and the registration was eventually successful. 

 The impact is not that high considering there's already an infinite retry and we don't get any incompletes due to this (as the worker isn't even able to pick up jobs anyways). I still think there's room for improvement (see ACs). 

 Note that the severity of the problem was likely because OSD was generally quite unresponsive at the time. However, this problem has been occurring before (just less severe and probably not hitting the gateway timeout). Especially the displaying problem (AC2) confused me before. 

 ### Acceptance criteria 
 * **AC1**: The websocket server is able to handle high load (a high number of connected workers like we have on OSD) better. 
 * **AC2**: Workers that have been registered via the API but haven't established the websocket connection yet are shown more clearly as such in the workers table. For instance, the message shown when clicking on the "?" next to "broken" could state that the worker is waiting for the websocket server.

Back