Project

General

Profile

Actions

coordination #155485

open

[saga][epic] Efficient openQA worker pool resource handling in datacenters

Added by okurz 5 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2022-07-25
Due date:
% Done:

32%

Estimated time:
(Total: 0.00 h)

Description

Motivation

Based on a bootstrapping discussions between lvogdt+mgriessmeier+okurz and my general goal to make efficient use of computing resources.
Our openQA instances especially openqa.suse.de but also openqa.opensuse.org hold ready ressources as in physical machines running openQA worker instances for different architectures and worker classes. There are enough resources put in place so that builds for products can finish testing in reasonable time, e.g. some hours for a new Tumbleweed snapshot. But in between builds often there are idle resources, e.g. idle x86_64 openQA workers. Both in on-premise datacenters as well as public cloud resources can be reassigned for other purposes. So we should teach our applications how to only request resources as needed, run necessary workloads from scratch when machines are dynamically switched and used for other purposes and give back resources to a resource pool if the workload schedules allow.


Subtasks 13 (10 open3 closed)

coordination #155488: [epic] openQA can dynamically power off/on available machines based on scheduler needsNew2024-02-14

Actions
openQA Infrastructure - coordination #158374: [epic] Prevention of inefficient hardware resource useNew2022-07-25

Actions
openQA Infrastructure - action #114622: Identification of unused/idle machines by alarming when there is no traffic on the corresponding switch ports for some timeNew2022-07-25

Actions
openQA Infrastructure - action #133700: Network bandwidth graphs per switch, like https://mrtg.suse.de/qanet13nue, for all current top-of-rack switches (TORs) that we are connected to size:MBlockedokurz2023-08-02

Actions
openQA Infrastructure - action #158377: Detect from monitoring data which monitored machines show a too low system usage over time size:MWorkable2024-04-01

Actions
openQA Infrastructure - action #158380: Detect and switch off from monitoring data which monitored machines show a too low CPU usage over timeNew2024-04-01

Actions
openQA Infrastructure - action #158383: Crosscheck which machines marked as "unused" in racktables are still pingable (as they should not be powered on at all) size:MResolvednicksinger2024-04-01

Actions
openQA Infrastructure - action #158386: Crosscheck which machines marked as "unused" in racktables still draw power according to ePDU data (as they should not be powered on and wasting significant power at all) - NUE2 size:MWorkable2024-04-01

Actions
openQA Infrastructure - action #158907: Automated check for machines marked as "unused" in racktables but still pingable (as they should not be powered on at all) size:MResolvednicksinger

Actions
coordination #159564: [epic] Efficient use of currently available hardware ressourcesNew2024-04-24

Actions
openQA Infrastructure - action #159534: Make use of all 12 redcurrant LPARsNew2024-04-24

Actions
coordination #162041: [epic] Improved worker behaviourNew

Actions
action #161327: Prevent remote workers registering as "localhost" size:SResolvedokurz

Actions
Actions #1

Updated by okurz 5 months ago

  • Tracker changed from action to coordination
Actions #2

Updated by okurz 5 months ago

  • Subtask #155488 added
Actions #3

Updated by okurz 5 months ago

  • Description updated (diff)
Actions #5

Updated by okurz 4 months ago

  • Subtask #158374 added
Actions #7

Updated by okurz 3 months ago

  • Subtask #159564 added
Actions #8

Updated by okurz about 1 month ago

  • Subtask #162041 added
Actions

Also available in: Atom PDF