action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #158143

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available

Added by okurz about 1 year ago. Updated 11 months ago.

Status:

New

Priority:

Low

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. We should make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available and optionally inform about the possibility to use the "limit to one host only" feature flag.

Acceptance Criteria¶

AC1: openQA workers with "tap" class but not configured for across-host multimachine setup do not fail openQA jobs due to being spread over multiple hosts
AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions¶

Look into what was done in #135035 but for the central openQA scheduler
Investigate if a worker knows about other workers that it would need to communicate with in a multi-machine cluster job, possibly during the "assignment" step
Implement a pre-run check, possibly during the "assignment" step, where the worker would check if pre-requisites for across-host multimachine testing are fulfilled if the test cluster would need that, and fail early
Ensure that such early failure is fed back to the openQA scheduler, e.g. by unassigning the job, possibly with an explicit message visible by admins somewhere?
If not possible to unassign then somehow "reject" jobs or as last resort "incomplete" a job with an explicit "reason" which is still better than actually starting an openQA job and then causing fails
Optionally in the message/reason returned suggest to the admin/users to use the feature flag from #135035

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Copied from action #135035: Optionally restrict multimachine jobs to a single worker added

Actions

Copy link

Updated by okurz about 1 year ago

Copied to action #158146: Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M added

Actions

Copy link

Updated by okurz about 1 year ago

Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added

Actions

Copy link