Project

General

Profile

Actions

action #158146

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M

Added by okurz 9 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-03-27
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. Can we propagate test variables like the "limit to one host only" feature flag in worker properties so that the openQA scheduler can see that flag before assigning to workers?

Acceptance Criteria

  • AC1: the openQA scheduler does not schedule across-host multimachine clusters to any host that has the feature flag from #135035 set or like that feature flag (considering proposals in #157144-2)
  • AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions

  • Look into what was done in #135035 but for the central openQA scheduler
  • Investigate if any worker properties are already available to read by the openQA scheduler when scheduling. At least it knows about the worker class already, right? Should we translate the feature flag from #135035 as a "special worker class" to act as an exclusive class that is only implemented by one host at a time?
  • Consider proposals in #157144-2 regarding using a special worker class or directly the flag from #135035 PARALLEL_ONE_HOST_ONLY=1
  • Ensure that the scheduler does not schedule across-host multimachine clusters to any host that has such special worker class or worker property

Related issues 2 (1 open1 closed)

Related to openQA Project (public) - action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configurationResolvedokurz2022-06-03

Actions
Copied from openQA Project (public) - action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not availableNew

Actions
Actions

Also available in: Atom PDF