action #158143
opencoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available
0%
Description
Motivation¶
Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. We should make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available and optionally inform about the possibility to use the "limit to one host only" feature flag.
Acceptance Criteria¶
- AC1: openQA workers with "tap" class but not configured for across-host multimachine setup do not fail openQA jobs due to being spread over multiple hosts
- AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts
Suggestions¶
- Look into what was done in #135035 but for the central openQA scheduler
- Investigate if a worker knows about other workers that it would need to communicate with in a multi-machine cluster job, possibly during the "assignment" step
- Implement a pre-run check, possibly during the "assignment" step, where the worker would check if pre-requisites for across-host multimachine testing are fulfilled if the test cluster would need that, and fail early
- Ensure that such early failure is fed back to the openQA scheduler, e.g. by unassigning the job, possibly with an explicit message visible by admins somewhere?
- If not possible to unassign then somehow "reject" jobs or as last resort "incomplete" a job with an explicit "reason" which is still better than actually starting an openQA job and then causing fails
- Optionally in the message/reason returned suggest to the admin/users to use the feature flag from #135035
Updated by okurz 8 months ago
- Copied from action #135035: Optionally restrict multimachine jobs to a single worker added
Updated by okurz 8 months ago
- Copied to action #158146: Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M added
Updated by okurz 6 months ago
- Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added