Project

General

Profile

action #80678

Make sure that certain amount of workers of some unique class is running

Added by asmorodskyi 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-12-03
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Goal: Introduce workflow where we can be sure that defined amount of instances are running.

Use case :
Currently we have limited amount of workers with WORKER_CLASS=pc_azure ( and same case goes also to pc_ec2 , pc_gce , pc_azure_qam etc. ). Due to technical issues one of workers which was running this instances goes down. this introduce several problems :

  1. there is no notification on that level . there was announcement that some of openQA worker is down but no warning about the consequence of this event.
  2. in case of long period of worker been down there quite complicated process to solve it ( you need to create PR for certain repo , pass review , get it merged and deployed ) all this time jobs which can be processed only by this worker class will hang in queue

Ideas how to solve it :

  1. Introduce API call allowing to add/remove WORKER_CLASS at runtime
  2. Introduce a service which would ping dedicated worker instances and act accordingly when they going down, raising up

History

#1 Updated by coolo 4 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/285#note_283832 should tell you that the worker class is not the only thing required. And I don't think we want to reimplement kubernetes here. So I'd vote for WONTFIX

#2 Updated by asmorodskyi 4 months ago

coolo wrote:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/285#note_283832 should tell you that the worker class is not the only thing required. And I don't think we want to reimplement kubernetes here. So I'd vote for WONTFIX

well this part is easy to address , we should just propagate this variables in all workers . Also "Ideas how to solve" is just "shoot into the sky" . Main purpose of this ticket is actually open a discussion how we can achieve a goal. For example Nick just suggested in testing the following : why not have this worker class on (almost) all workers? But make sure only X can run simultaneously. . Basically moving the problem from WORKER_CLASS definition to scheduler

#3 Updated by coolo 4 months ago

Let's take the workers on grenache. They all have "I can control a s390 VM" worker class. That doesn't make them interchangable - as which VM is configured in their settings.

A "per worker class limit" sounds like a plausible idea, but I sense the "my job is scheduled, why doesn't workerX pick it? It has the right worker class" question will come up daily.

#4 Updated by asmorodskyi 4 months ago

coolo wrote:

Let's take the workers on grenache. They all have "I can control a s390 VM" worker class. That doesn't make them interchangable - as which VM is configured in their settings.

A "per worker class limit" sounds like a plausible idea, but I sense the "my job is scheduled, why doesn't workerX pick it? It has the right worker class" question will come up daily.

since we discussing potential move of this problem to scheduler , we can drop WORKER_CLASS use here at all and come up with some variable like SCHEDULER_CGROUP=azure SCHEDULER_CGROUP_LIMIT=2 and drop pc_azure/pc_ec2 classes at all

#5 Updated by okurz 4 months ago

  • Category set to Feature requests
  • Target version set to future

coolo wrote:

I don't think we want to reimplement kubernetes here. So I'd vote for WONTFIX

Definitely not planned to reimplement kubernetes here. But I actually thought of kubernetes as a way to solve it because kubernetes already controls something like "ensure there are at least N nodes of one type" or "max M nodes of one type". Any non-qemu workers do not have that high performance requirements so scaling there could be done easier with something like kubernetes instead of manually configuring lists on dedicated machines that were really intended to run virtualization themselves instead of just passing through keys and ssh and encoding a video stream.

Thinking of simple examples for the mentioned "per worker class limit" one could specify it with WORKER_CLASS=pc_azure:2 but coolo is right that the current list of workers with their worker class that can give an indication about free capacity would not provide the feedback for the artificial limiting easily. When thinking about implementing said limit I would at the same time consider how users can monitor and control it.

Also available in: Atom PDF