Project

General

Profile

Actions

action #135362

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

Optimize worker status update handling in websocket server size:M

Added by kraih 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-09-07
Due date:
% Done:

0%

Estimated time:
Difficulty:
hard

Description

Motivation

#135122 has shown that there are very severe performance issues in the websocket server that can cause the service to get blocked from assigning jobs, because it is busy dealing with database queries for worker status updates.

Acceptance criteria

  • AC1: The hot code path for worker status updates is no longer a performance bottleneck.

Suggestions

  • Reduce the number of database queries.
  • Get rid of the worker number broadcast to workers, which was meant to help with this problem, but has now become a bottleneck itself.
  • Make sure multiple worker status messages from the same worker don't clog the websocket buffer.

Files

before.png (150 KB) before.png kraih, 2023-09-12 11:17
after.png (148 KB) after.png kraih, 2023-09-12 11:17
Actions #1

Updated by kraih 3 months ago

  • Status changed from New to In Progress
Actions #2

Updated by livdywan 3 months ago

  • Subject changed from Optimize worker status update handling in websocket server to Optimize worker status update handling in websocket server size:M
Actions #4

Updated by kraih 3 months ago

Second patch, for a more significant performance improvement: https://github.com/os-autoinst/openQA/pull/5294

Actions #5

Updated by openqa_review 3 months ago

  • Due date set to 2023-09-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by kraih 3 months ago

Third patch with another good performance boost: https://github.com/os-autoinst/openQA/pull/5297

Actions #7

Updated by tinita 3 months ago

kraih wrote in #note-6:

Third patch with another good performance boost: https://github.com/os-autoinst/openQA/pull/5297

Deployed Saturday morning on osd

Actions #8

Updated by kraih 3 months ago

tinita wrote in #note-7:

Deployed Saturday morning on osd

And this is what the status update intervals look like for one worker now, very random:

Sep 11 12:02:49 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:05:39 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:06:48 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:10:38 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:12:18 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:15:11 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:17:37 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:22:13 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:24:18 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:26:50 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:29:05 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:31:49 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:33:57 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:36:32 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:39:39 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:43:52 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:47:40 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:52:21 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 12:55:53 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:00:16 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:05:07 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:06:46 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:09:41 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:10:50 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:13:24 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:15:26 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:20:12 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:23:36 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:24:47 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:27:38 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:32:27 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:36:48 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:37:52 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:42:08 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:43:48 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:46:08 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Sep 11 13:49:31 openqa openqa-websockets-daemon[29588]: [debug] [pid:29588] Updating seen of worker 2218 from worker_status (free)
Actions #9

Updated by kraih 3 months ago

Fourth patch, with a more surprising optimization: https://github.com/os-autoinst/openQA/pull/5299

Updated by kraih 3 months ago

To give an update on the current state of websocket server optimizations: I've been profiling benchmark runs with 1000 status updates from a worker with 1420 previous jobs in the database. For that case we started at 8.66s before patch 1, and are now at 3.70s with patch 6. Currently there are no new hot spots identifiable in my profiling data. So i will shift focus to better logging, to help us identify other areas that need improvements. Specific areas of interest will be lock contention in the database and buffering in the websocket server implementation, which can both be triggered by outside factors a simple profiling benchmark won't show.

Actions #12

Updated by kraih 3 months ago

Patch 7 with a new log message to help with identifying capacity issues: https://github.com/os-autoinst/openQA/pull/5303

Actions #13

Updated by livdywan 3 months ago

  • Due date changed from 2023-09-22 to 2023-10-06

Let's wait a bit to see how this looks

Actions #14

Updated by kraih 3 months ago

Hot patched the two latest patches into production on OSD and almost immediately got useful results:

Sep 13 15:16:40 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Started to send message to 887 for job(s) 12102468
Sep 13 15:16:40 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Worker 887 accepted job 12102468
...
Sep 13 15:17:11 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 887 from worker_status (working)
...
Sep 13 15:17:12 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 3236 from worker_status (working)
Sep 13 15:17:12 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2417 from worker_status (free)
Sep 13 15:17:13 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2896 from worker_status (free)
Sep 13 15:17:13 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2382 from worker_status (free)
Sep 13 15:17:13 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2898 from worker_status (working)
Sep 13 15:17:13 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 3230 from worker_status (working)
Sep 13 15:17:13 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2507 from worker_status (working)
Sep 13 15:17:13 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 914 from worker_status (working)
Sep 13 15:17:13 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 3113 from worker_status (working)
Sep 13 15:17:13 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2488 from worker_status (free)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 3234 from worker_status (working)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2205 from worker_status (working)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2214 from worker_status (working)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 898 from worker_status (working)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2104 from worker_status (working)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 3026 from worker_status (working)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 2853 from worker_status (working)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 885 from worker_status (working)
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [info] [pid:30094] Received worker 887 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Sep 13 15:17:14 openqa openqa-websockets-daemon[30094]: [debug] [pid:30094] Updating seen of worker 887 from worker_status (working)

So worker 887 sent status updates for "working" only 3 seconds apart. Something about that is wrong, and i'm not sure how to interpret the data. If two messages had been buffered we would see them immediately after another, but here we have other status updates in between.

Update: This happens because during engine startup, after a new job has been assigned, the worker sends an extra status update here which is exempt from the calculated interval.

Actions #15

Updated by kraih 3 months ago

kraih wrote in #note-14:

Update: This happens because during engine startup, after a new job has been assigned, the worker sends an extra status update here which is exempt from the calculated interval.

And patch 8 to exclude that case: https://github.com/os-autoinst/openQA/pull/5307

Actions #16

Updated by kraih 3 months ago

  • Status changed from In Progress to Feedback

Time to wait for new data.

Actions #17

Updated by okurz 3 months ago

let's wait over some days to have enough data, see due date

Actions #18

Updated by kraih 3 months ago

In the past 24 hours we've only had one case of the log message, and that one actually seems relevant, but very very close to the limit (checked with sudo journalctl --since="24 hours ago" -u openqa-websockets -g 'overloaded'):

Sep 15 07:58:03 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2847 websocket connection closed - 1006
Sep 15 07:58:07 openqa openqa-websockets-daemon[1555]: [debug] [pid:1555] Updating seen of worker 2847 from worker_status (free)
Sep 15 07:59:06 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Received worker 2847 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Sep 15 07:59:06 openqa openqa-websockets-daemon[1555]: [debug] [pid:1555] Updating seen of worker 2847 from worker_status (free)

Seems the processing of the initial first status update came a little delayed because the websocket server was busy. Looking at other log messages around the same time, it appears there were a lot of connections getting reset:

Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2838 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2853 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2832 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2925 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2887 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2911 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2839 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2843 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2833 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2842 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2919 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2834 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2893 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2909 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2888 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2885 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2809 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2906 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2870 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2813 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2519 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2493 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 3234 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2515 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2504 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2492 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2495 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2518 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2526 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2494 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2502 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2498 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 3232 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2497 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2521 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2520 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2508 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2506 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2500 websocket connection closed - 1006
Sep 15 07:58:05 openqa openqa-websockets-daemon[1555]: [info] [pid:1555] Worker 2507 websocket connection closed - 1006

So the log message has worked and shown us a case in which the websocket server was overloaded for a very short amount of time by lots of reconnects.

Actions #19

Updated by kraih 2 months ago

While i am seeing a few log messages for the past two weeks, it seems they were all caused by worker service restarts:

Oct 01 07:14:46 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 1962 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:14:46 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 1939 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:14:46 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 1953 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:14:46 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 1961 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:14:47 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 1954 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:14:47 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 1959 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:14:47 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 1951 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:15:35 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 2454 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:15:38 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 2488 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:15:41 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 2840 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 01 07:15:47 openqa openqa-websockets-daemon[7052]: [info] [pid:7052] Received worker 2269 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 02 07:15:14 openqa openqa-websockets-daemon[14839]: [info] [pid:14839] Received worker 2434 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 02 07:15:15 openqa openqa-websockets-daemon[14839]: [info] [pid:14839] Received worker 2417 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 02 07:15:16 openqa openqa-websockets-daemon[14839]: [info] [pid:14839] Received worker 2442 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 02 07:15:17 openqa openqa-websockets-daemon[14839]: [info] [pid:14839] Received worker 2471 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 02 07:15:20 openqa openqa-websockets-daemon[14839]: [info] [pid:14839] Received worker 2497 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 02 07:15:20 openqa openqa-websockets-daemon[14839]: [info] [pid:14839] Received worker 2349 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 02 07:15:21 openqa openqa-websockets-daemon[14839]: [info] [pid:14839] Received worker 2869 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 02 07:15:21 openqa openqa-websockets-daemon[14839]: [info] [pid:14839] Received worker 2851 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 04 10:02:49 openqa openqa-websockets-daemon[23029]: [info] [pid:23029] Received worker 3091 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 03:10:58 openqa openqa-websockets-daemon[23029]: [info] [pid:23029] Received worker 3288 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:14:19 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 3278 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:14:19 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 3272 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:15:11 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 2908 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:15:12 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 3053 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:15:13 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 3109 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:15:13 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 2848 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:15:13 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 3315 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:15:20 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 2268 status too close to the last update, websocket server possibly overloaded or worker misconfigured
Oct 05 07:15:20 openqa openqa-websockets-daemon[15788]: [info] [pid:15788] Received worker 2287 status too close to the last update, websocket server possibly overloaded or worker misconfigured
● openqa-worker-auto-restart@1.service - openQA Worker #1
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; ve>
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─30-openqa-max-inactive-caching-downloads.conf
     Active: active (running) since Thu 2023-10-05 07:14:15 CEST; 6h ago

It appears the websocket server has been running pretty smoothly. So i think we can consider this problem resolved for now. It remains to be seen where the new upper limit for number of websocket connections is. Once we've reached that we should probably look into removing in-memory state from the websocket service, so it can be scaled up with preforking.

Actions #20

Updated by kraih 2 months ago

  • Status changed from Feedback to Resolved
Actions #21

Updated by okurz 2 months ago

  • Due date deleted (2023-10-06)
Actions

Also available in: Atom PDF