Project

General

Profile

Actions

action #157666

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

OSD unresponsive and then not starting any more jobs on 2024-03-21

Added by okurz 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

No firing alerts yet. Problem was reported in https://suse.slack.com/archives/C02CANHLANP/p1711027186175969, people couldn't access the webUI and jobs were delayed in starting. Also reported in https://suse.slack.com/archives/C02CANHLANP/p1711019658421359

Suggestions

  • Maybe again too many online worker instances https://openqa.suse.de/admin/workers as in #135122
  • Disable/disconnect/poweroff some workers to reduce load on OSD -> okurz did that with worker3[6-9] by completely powering of the machines reducing the number of online openQA worker instances from 1006 to 878, see https://openqa.suse.de/admin/workers . After that the openQA scheduler near-immediate picked up new jobs again so apparently this was a helpful mitigation.
  • Monitor over the next days/weeks if we are hitting unresponsiveness and scheduler refusing to assign jobs. If we still see such issues then we must investigate further what else could be the problem otherwise we would need to block on #157669 before enabling more worker instances again.

Out of scope

  • Any significant change in implementation

Rollback steps

  • Bring worker3[6-9] back online using IPMI commands

Related issues 4 (1 open3 closed)

Related to openQA Project - coordination #157669: websockets+scheduler improvements to support more online worker instancesNew2023-08-31

Actions
Related to openQA Infrastructure - action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)Resolvedokurz2024-03-18

Actions
Related to openQA Infrastructure - action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket serverResolvedokurz2024-09-28

Actions
Copied from openQA Infrastructure - action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30ZResolvedokurz2024-03-12

Actions
Actions #1

Updated by okurz 7 months ago

  • Copied from action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z added
Actions #2

Updated by okurz 7 months ago

  • Description updated (diff)
  • Priority changed from Normal to Urgent
Actions #3

Updated by okurz 7 months ago

  • Description updated (diff)
Actions #4

Updated by okurz 7 months ago

Using strace on the openqa-scheduler process tinita, mkittler and me saw a lot of "SELECT value FROM worker_properties…" so this might be a point for optimization.

Actions #5

Updated by okurz 7 months ago

We identified multiple improvement points within #157669

Actions #7

Updated by okurz 7 months ago

  • Related to coordination #157669: websockets+scheduler improvements to support more online worker instances added
Actions #8

Updated by okurz 7 months ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High
Actions #9

Updated by okurz 7 months ago

  • Related to action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org) added
Actions #10

Updated by okurz 7 months ago

  • Status changed from Feedback to Resolved

Right now 100 jobs running, 600 scheduled, so all good. Follow-up tasks have been defined.

Actions #11

Updated by tinita about 1 month ago

  • Related to action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket server added
Actions

Also available in: Atom PDF