action #134924
opencoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #157669: websockets+scheduler improvements to support more online worker instances
Websocket server overloaded, affected worker slots shown as "broken" with graceful disconnect in workers table
0%
Description
Observation¶
When debugging the OSD worker 40 after the VM migration I've noticed that many worker slots are shown as "broken" with the reason "graceful disconnect …". This looks weird. The worker slot's journal reveals that the worker is really just waiting for the websocket server to respond:
Aug 31 12:11:55 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:11:56 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:16:56 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:17:06 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:17:10 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:22:10 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:22:20 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3108
The worker first registers via the API and then establishes a websocket connection. Here we can see that the establishing the websocket connection timed out after 5 minutes (likely hitting the gateway timeout). It was then retried and the websocket server was still quite slot but at least the timeout wasn't exceeded anymore and the registration was eventually successful.
The impact is not that high considering there's already an infinite retry and we don't get any incompletes due to this (as the worker isn't even able to pick up jobs anyways). I still think there's room for improvement (see ACs).
Note that the severity of the problem was likely because OSD was generally quite unresponsive at the time. However, this problem has been occurring before (just less severe and probably not hitting the gateway timeout). Especially the displaying problem (AC2) confused me before.
Acceptance criteria¶
- AC1: The websocket server is able to handle high load (a high number of connected workers like we have on OSD) better.
- AC2: Workers that have been registered via the API but haven't established the websocket connection yet are shown more clearly as such in the workers table. For instance, the message shown when clicking on the "?" next to "broken" could state that the worker is waiting for the websocket server.