action #158146: Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #158146

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M

Added by okurz about 1 year ago. Updated 11 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2024-03-27

Due date:

% Done:

Estimated time:

Description

Motivation¶

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. Can we propagate test variables like the "limit to one host only" feature flag in worker properties so that the openQA scheduler can see that flag before assigning to workers?

Acceptance Criteria¶

AC1: the openQA scheduler does not schedule across-host multimachine clusters to any host that has the feature flag from #135035 set or like that feature flag (considering proposals in #157144-2)
AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions¶

Look into what was done in #135035 but for the central openQA scheduler
Investigate if any worker properties are already available to read by the openQA scheduler when scheduling. At least it knows about the worker class already, right? Should we translate the feature flag from #135035 as a "special worker class" to act as an exclusive class that is only implemented by one host at a time?
Consider proposals in #157144-2 regarding using a special worker class or directly the flag from #135035 PARALLEL_ONE_HOST_ONLY=1
Ensure that the scheduler does not schedule across-host multimachine clusters to any host that has such special worker class or worker property

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #158146

Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M

Motivation¶

Acceptance Criteria¶

Suggestions¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by livdywan about 1 year ago

Updated by ybonatakis 12 months ago

Updated by openqa_review 12 months ago

Updated by ybonatakis 12 months ago

Updated by mkittler 12 months ago · Edited

Updated by ybonatakis 12 months ago

Updated by ybonatakis 12 months ago

Updated by okurz 12 months ago

Updated by livdywan 11 months ago

Updated by ybonatakis 11 months ago

Updated by ybonatakis 11 months ago

Updated by mkittler 11 months ago · Edited

Updated by ybonatakis 11 months ago

Updated by mkittler 11 months ago

Updated by okurz 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago · Edited

Updated by mkittler 11 months ago · Edited

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago