action #135035
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Optionally restrict multimachine jobs to a single worker
Description
Motivation¶
Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker.
The purpose of this ticket is to have all multi-machine runs be scheduled on the same well-configured worker.
The change doesn't need to be permanent but it does need to be applied until proper networking between multi-machine nodes can be guaranteed.
Acceptance Criteria¶
- AC1: If configured accordingly all jobs of a multi-machine parallel cluster must be scheduled to run on the same worker host
- AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts
Suggestions¶
- Have a look at https://github.com/Martchus/openQA/pull/new/dependency-pinning for how this could be enabled and documented.
Updated by apappas about 1 year ago
- Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Updated by okurz about 1 year ago
- Category set to Feature requests
- Target version set to future
Good idea for a workaround. The workaround for the workaround is to pin to a specific machine
Updated by apappas about 1 year ago
The workaround for the workaround is to pin to a specific machine.
I do not understand.
We will pin to a specific machine as a bridge until this is implemented.
Target version set to future
Can we get either a concrete ETA or a rejection?
Updated by okurz about 1 year ago
apappas wrote in #note-4:
Target version set to future
Can we get either a concrete ETA or a rejection?
The ETA is: Certainly not within the next days or weeks. I don't see why we should reject the feature request. It's a good idea and valid for openQA. The team just doesn't have capacity to work on that anytime soon.
Updated by asmorodskyi about 1 year ago
I want to remind you that it is actually rollback to state which we had some years ago when MM tests was ALWAYS running on same host . This was dramatically increasing wait time in queue for MM tests because mixed queue with MM jobs and single jobs hard to catch condition when two worker instances in same worker are free. To resolve this problem GRE bridges was introduced . Now if we will drop this we will get back to old problem so we need to make sure that we address old problem before switching to this mode
Updated by okurz 11 months ago
- Copied to action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or location added
Updated by okurz 11 months ago
- Related to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added
Updated by openqa_review 8 months ago
- Due date set to 2024-04-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 8 months ago
- Status changed from In Progress to Feedback
PR: https://github.com/os-autoinst/openQA/pull/5536
The PR is ready from my side and be good enough for all the clusters/worker-classes we have in production. I'm only waiting for reviews.
Updated by okurz 8 months ago
- Copied to action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added
Updated by okurz 8 months ago
- Related to coordination #157144: [epic] Groups of worker classes: Regions, locations, etc. added
Updated by mkittler 8 months ago
- Status changed from Feedback to Resolved
The PR was merged yesterday and it fulfills the ACs. That's not the end of the story (see https://github.com/os-autoinst/openQA/pull/5536#issuecomment-2022848509) but I would resolve this ticket now considering we have the follow-up tickets #158146 and #158143.
Updated by okurz 7 months ago
- Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added