action #135035
closed
coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Optionally restrict multimachine jobs to a single worker
Added by apappas over 1 year ago.
Updated 8 months ago.
Category:
Feature requests
Description
Motivation¶
Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker.
The purpose of this ticket is to have all multi-machine runs be scheduled on the same well-configured worker.
The change doesn't need to be permanent but it does need to be applied until proper networking between multi-machine nodes can be guaranteed.
Acceptance Criteria¶
- AC1: If configured accordingly all jobs of a multi-machine parallel cluster must be scheduled to run on the same worker host
- AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts
Suggestions¶
- Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
- Description updated (diff)
- Category set to Feature requests
- Target version set to future
Good idea for a workaround. The workaround for the workaround is to pin to a specific machine
The workaround for the workaround is to pin to a specific machine.
I do not understand.
We will pin to a specific machine as a bridge until this is implemented.
Target version set to future
Can we get either a concrete ETA or a rejection?
apappas wrote in #note-4:
Target version set to future
Can we get either a concrete ETA or a rejection?
The ETA is: Certainly not within the next days or weeks. I don't see why we should reject the feature request. It's a good idea and valid for openQA. The team just doesn't have capacity to work on that anytime soon.
I want to remind you that it is actually rollback to state which we had some years ago when MM tests was ALWAYS running on same host . This was dramatically increasing wait time in queue for MM tests because mixed queue with MM jobs and single jobs hard to catch condition when two worker instances in same worker are free. To resolve this problem GRE bridges was introduced . Now if we will drop this we will get back to old problem so we need to make sure that we address old problem before switching to this mode
- Target version changed from future to Tools - Next
- Parent task set to #111929
- Description updated (diff)
- Copied to action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or location added
- Description updated (diff)
- Related to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added
- Description updated (diff)
- Subject changed from [tools]Pin multimachine jobs to a single worker to Optionally restrict multimachine jobs to a single worker
- Target version changed from Tools - Next to Ready
- Status changed from New to In Progress
- Due date set to 2024-04-02
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
- Copied to action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added
- Status changed from Feedback to Resolved
- Due date deleted (
2024-04-02)
- Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added
Also available in: Atom
PDF