action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #139103

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #139010: [epic] Long OSD ppc64le job queue

Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M

Added by okurz about 1 year ago. Updated 3 months ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Feature requests

Target version:

openQA Project (public) - Ready

Start date:

2023-11-04

Due date:

% Done:

Estimated time:

Tags:

osd, ppc64le, infra, nue3, nue2

Description

Motivation¶

Currently on OSD there is a longer job queue in particular for ppc64le for multiple reasons, see #139010. One idea to decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs due to the OSD openQA instance job limit.

Acceptance criteria¶

AC1: The impact of worker instance ratio by arch/class has been verified
AC2: Given the openQA instance job limit is impacting the ppc64le job queue When the ratio of ppc64le/all workers has been increased Then the ppc64le job age is lower

Suggestions¶

DONE Look up current number of x86_64 and qemu ppc64le jobs assuming that we have a very low ppc64le/all ratio, e.g. many workers for qemu_x86_64 and very few for qemu_ppc64le (16 as of 2023-11-04).
DONE Reduce number of x86_64 qemu slots if we have "too many"
Monitor for the impact on qemu_ppc64le job age
Increase the amount of ppc64le machines and then again re-enable x86_64 machines
Take care to apply the workarounds from #157975-12 to prevent accidental distribution upgrades

Rollback steps¶

Re-enable openQA OSD workers w35-w36, remove according alert https://monitor.qa.suse.de/alerting/silence/e2c36842-e6a9-4d48-aeef-330c3d8604c7/edit?alertmanager=grafana
Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/687 to enable multi-machine tests after ensuring stability

Out of scope¶

Any code changes for the scheduler

Related issues 5 (3 open — 2 closed)

Related to openQA Tests (public) - action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems

New

2023-11-24

Actions

Related to openQA Project (public) - action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S

Feedback

dheidler

2024-06-14

2025-01-23

Actions

Related to openQA Infrastructure (public) - action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)

Resolved

okurz

2024-03-18

Actions

Copied from openQA Infrastructure (public) - action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2

Resolved

okurz

2023-11-04

Actions

Copied to openQA Infrastructure (public) - action #166802: Recover worker37, worker38, worker39 size:S

Blocked

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #139103

Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago · Edited

Updated by okurz about 1 year ago

Updated by acarvajal about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz 11 months ago

Updated by okurz 9 months ago

Updated by okurz 7 months ago · Edited

Updated by okurz 6 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by nicksinger 4 months ago

Updated by nicksinger 4 months ago

Updated by nicksinger 4 months ago

Updated by okurz 4 months ago

Updated by nicksinger 4 months ago

Updated by openqa_review 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by nicksinger 4 months ago

Updated by okurz 4 months ago

Updated by nicksinger 4 months ago

Updated by okurz 4 months ago · Edited

Updated by okurz 4 months ago

Updated by okurz 4 months ago · Edited

Updated by okurz 4 months ago

Updated by okurz 3 months ago