Project

General

Profile

Actions

action #135407

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

[tools] Measure to mitigate websockets overload by workers and revert it size:M

Added by osukup 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-09-08
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Consolidate all steps we took to mitigate #135122 and how to revert it.

1) stopped workers:

used:
sudo salt 'worker3[1,2,3,4,5,6]*' cmd.run 'sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d "." -f 1 | xargs)'\
&& for i in {1..6}; do sudo salt-key -y -d "worker3$i*"; done

revert:
for i in {1..6}; do sudo salt-key -y -a "worker3$i*";done && sudo salt 'worker3[1,2,3,4,5,6]*' state.apply

2) Lowered amount workers

used:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/606

revert:
revert mentioned MR in GitLab

Acceptance criteria

  • AC1: Ensure step #1 has been reverted
  • AC2: DONE Ensure step #2 has been reverted

Suggestions

  • Maybe don't bring them all back at once (and be prepared to remove them again in case of new performance issues)
  • In case of new performance issues make sure to strace the openqa-scheduler and openqa-websockets processes

Related issues 2 (1 open1 closed)

Related to openQA Project - action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:MResolveddheidler

Actions
Copied to openQA Infrastructure - action #137756: Re-enable worker31 for multi-machine tests in production auto_review:"tcpdump.+check.log.+timed out at"Blockedokurz

Actions
Actions

Also available in: Atom PDF