Project

General

Profile

Actions

action #158143

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available

Added by okurz 8 months ago. Updated 4 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. We should make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available and optionally inform about the possibility to use the "limit to one host only" feature flag.

Acceptance Criteria

  • AC1: openQA workers with "tap" class but not configured for across-host multimachine setup do not fail openQA jobs due to being spread over multiple hosts
  • AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions

  • Look into what was done in #135035 but for the central openQA scheduler
  • Investigate if a worker knows about other workers that it would need to communicate with in a multi-machine cluster job, possibly during the "assignment" step
  • Implement a pre-run check, possibly during the "assignment" step, where the worker would check if pre-requisites for across-host multimachine testing are fulfilled if the test cluster would need that, and fail early
  • Ensure that such early failure is fed back to the openQA scheduler, e.g. by unassigning the job, possibly with an explicit message visible by admins somewhere?
  • If not possible to unassign then somehow "reject" jobs or as last resort "incomplete" a job with an explicit "reason" which is still better than actually starting an openQA job and then causing fails
  • Optionally in the message/reason returned suggest to the admin/users to use the feature flag from #135035

Related issues 3 (0 open3 closed)

Related to openQA Project - action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configurationResolvedokurz2022-06-03

Actions
Copied from openQA Project - action #135035: Optionally restrict multimachine jobs to a single workerResolvedmkittler2023-09-01

Actions
Copied to openQA Project - action #158146: Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:MResolvedmkittler2024-03-27

Actions
Actions #1

Updated by okurz 8 months ago

  • Copied from action #135035: Optionally restrict multimachine jobs to a single worker added
Actions #2

Updated by okurz 8 months ago

  • Copied to action #158146: Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M added
Actions #3

Updated by okurz 6 months ago

  • Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added
Actions #4

Updated by okurz 6 months ago

  • Target version changed from future to Tools - Next

#160646 makes it necessary that we apply more priority

Actions #5

Updated by okurz 5 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

After #158146

Actions #6

Updated by okurz 4 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
  • Target version changed from Tools - Next to future
Actions

Also available in: Atom PDF