Project

General

Profile

Actions

action #157666

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

OSD unresponsive and then not starting any more jobs on 2024-03-21

Added by okurz about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

No firing alerts yet. Problem was reported in https://suse.slack.com/archives/C02CANHLANP/p1711027186175969, people couldn't access the webUI and jobs were delayed in starting. Also reported in https://suse.slack.com/archives/C02CANHLANP/p1711019658421359

Suggestions

  • Maybe again too many online worker instances https://openqa.suse.de/admin/workers as in #135122
  • Disable/disconnect/poweroff some workers to reduce load on OSD -> okurz did that with worker3[6-9] by completely powering of the machines reducing the number of online openQA worker instances from 1006 to 878, see https://openqa.suse.de/admin/workers . After that the openQA scheduler near-immediate picked up new jobs again so apparently this was a helpful mitigation.
  • Monitor over the next days/weeks if we are hitting unresponsiveness and scheduler refusing to assign jobs. If we still see such issues then we must investigate further what else could be the problem otherwise we would need to block on #157669 before enabling more worker instances again.

Out of scope

  • Any significant change in implementation

Rollback steps

  • Bring worker3[6-9] back online using IPMI commands

Related issues 3 (2 open1 closed)

Related to openQA Project - coordination #157669: websockets+scheduler improvementsNew2023-08-31

Actions
Related to openQA Infrastructure - action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)Blockedokurz2024-03-18

Actions
Copied from openQA Infrastructure - action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30ZResolvedokurz2024-03-12

Actions
Actions

Also available in: Atom PDF