Project

General

Profile

Actions

action #158146

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves

Added by okurz about 1 month ago. Updated about 21 hours ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-03-27
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. Can we propagate test variables like the "limit to one host only" feature flag in worker properties so that the openQA scheduler can see that flag before assigning to workers?

Acceptance Criteria

  • AC1: the openQA scheduler does not schedule across-host multimachine clusters to any host that has the feature flag from #135035 set
  • AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions

  • Look into what was done in #135035 but for the central openQA scheduler
  • Investigate if any worker properties are already available to read by the openQA scheduler when scheduling. At least it knows about the worker class already, right? Should we translate the feature flag from #135035 as a "special worker class" to act as an exclusive class that is only implemented by one host at a time?
  • Ensure that the scheduler does not schedule across-host multimachine clusters to any host that has such special worker class or worker property

Related issues 1 (1 open0 closed)

Copied from openQA Project - action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not availableNew

Actions
Actions #1

Updated by okurz about 1 month ago

  • Copied from action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added
Actions #2

Updated by okurz about 1 month ago

  • Subject changed from Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available to Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves
Actions #3

Updated by okurz about 21 hours ago

  • Target version changed from future to Tools - Next
Actions

Also available in: Atom PDF