Project

General

Profile

Actions

action #135362

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

Optimize worker status update handling in websocket server size:M

Added by kraih 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-09-07
Due date:
% Done:

0%

Estimated time:

Description

Motivation

#135122 has shown that there are very severe performance issues in the websocket server that can cause the service to get blocked from assigning jobs, because it is busy dealing with database queries for worker status updates.

Acceptance criteria

  • AC1: The hot code path for worker status updates is no longer a performance bottleneck.

Suggestions

  • Reduce the number of database queries.
  • Get rid of the worker number broadcast to workers, which was meant to help with this problem, but has now become a bottleneck itself.
  • Make sure multiple worker status messages from the same worker don't clog the websocket buffer.

Files

before.png (150 KB) before.png kraih, 2023-09-12 11:17
after.png (148 KB) after.png kraih, 2023-09-12 11:17
Actions

Also available in: Atom PDF