Project

General

Profile

Actions

action #134924

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #157669: websockets+scheduler improvements

Websocket server overloaded, affected worker slots shown as "broken" with graceful disconnect in workers table

Added by mkittler 8 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2023-08-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

When debugging the OSD worker 40 after the VM migration I've noticed that many worker slots are shown as "broken" with the reason "graceful disconnect …". This looks weird. The worker slot's journal reveals that the worker is really just waiting for the websocket server to respond:

Aug 31 12:11:55 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:11:56 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:16:56 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:17:06 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:17:10 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:22:10 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:22:20 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3108

The worker first registers via the API and then establishes a websocket connection. Here we can see that the establishing the websocket connection timed out after 5 minutes (likely hitting the gateway timeout). It was then retried and the websocket server was still quite slot but at least the timeout wasn't exceeded anymore and the registration was eventually successful.

The impact is not that high considering there's already an infinite retry and we don't get any incompletes due to this (as the worker isn't even able to pick up jobs anyways). I still think there's room for improvement (see ACs).

Note that the severity of the problem was likely because OSD was generally quite unresponsive at the time. However, this problem has been occurring before (just less severe and probably not hitting the gateway timeout). Especially the displaying problem (AC2) confused me before.

Acceptance criteria

  • AC1: The websocket server is able to handle high load (a high number of connected workers like we have on OSD) better.
  • AC2: Workers that have been registered via the API but haven't established the websocket connection yet are shown more clearly as such in the workers table. For instance, the message shown when clicking on the "?" next to "broken" could state that the worker is waiting for the websocket server.

Related issues 1 (0 open1 closed)

Related to openQA Project - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alertResolvedokurz2023-09-07

Actions
Actions #1

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #2

Updated by livdywan 8 months ago

  • Subject changed from Websocket server overloaded, affected worker slots shown as "broken" with graful disconnect in workers table to Websocket server overloaded, affected worker slots shown as "broken" with graceful disconnect in workers table
Actions #3

Updated by okurz 8 months ago

  • Related to coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert added
Actions #4

Updated by okurz 8 months ago

This might be more severe than we think. Although I assume the workaround would be to reduce the numbers of workers connected.

Actions #5

Updated by kraih 8 months ago

First small optimization PR removing one UPDATE query: https://github.com/os-autoinst/openQA/pull/5293

Actions #6

Updated by okurz about 1 month ago

  • Parent task set to #110833
Actions #7

Updated by okurz about 1 month ago

  • Parent task changed from #110833 to #157669
Actions

Also available in: Atom PDF