Project

General

Profile

Actions

action #157726

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #108209: [epic] Reduce load on OSD

osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)

Added by livdywan 9 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-03-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2415705

worker37.oqa.prg2.suse.org:
    Minion did not return. [Not connected]
worker36.oqa.prg2.suse.org:
    Minion did not return. [Not connected]
worker38.oqa.prg2.suse.org:
    Minion did not return. [Not connected]
worker39.oqa.prg2.suse.org:
    Minion did not return. [Not connected]

Acceptance criteria

  • AC1: osd-deployment passes again
  • AC1: All w37-w39 run OSD production jobs

Suggestions

Rollback steps


Related issues 4 (2 open2 closed)

Related to openQA Infrastructure (public) - action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21Resolvedokurz2024-03-12

Actions
Related to openQA Project (public) - coordination #157669: websockets+scheduler improvements to support more online worker instancesNew2023-08-31

Actions
Related to openQA Infrastructure (public) - action #166802: Recover worker37, worker38, worker39 size:SBlockedokurz

Actions
Related to openQA Infrastructure (public) - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:MResolvedokurz2023-11-04

Actions
Actions #1

Updated by okurz 9 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #2

Updated by okurz 9 months ago

  • Related to action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21 added
Actions #3

Updated by okurz 9 months ago

  • Parent task set to #108209
Actions #4

Updated by okurz 9 months ago

  • Description updated (diff)
  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal
  • Target version changed from Ready to future
Actions #5

Updated by okurz 9 months ago

  • Related to coordination #157669: websockets+scheduler improvements to support more online worker instances added
Actions #6

Updated by okurz 9 months ago

Blocking on #157669

Actions #7

Updated by okurz 5 months ago

As discussed today among nicksinger and me for the time being we can keep worker3[6-9] offline, especially as there is less load over the summer and we can save electrical energy. When there is a need to bring in more workers then this can be done at any time as needed.

Actions #8

Updated by okurz 3 months ago

  • Related to action #166802: Recover worker37, worker38, worker39 size:S added
Actions #9

Updated by okurz 3 months ago

  • Related to action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M added
Actions #10

Updated by okurz 3 months ago

  • Target version changed from future to Ready

Worked on as part of #139103 and #166802

Actions #11

Updated by okurz 3 months ago

  • Description updated (diff)
Actions #12

Updated by okurz 2 months ago

  • Status changed from Blocked to Resolved

OSD deployment is fine. w37+w38 are still offline and handled in #166802 due to performance constraints

Actions

Also available in: Atom PDF