Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker.
The purpose of this ticket is to have all multi-machine runs be scheduled on the same well-configured worker.
The change doesn't need to be permanent but it does need to be applied until proper networking between multi-machine nodes can be guaranteed.
- AC1: If configured accordingly all jobs of a multi-machine parallel cluster must be scheduled to run on the same worker host
- AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts
- Have a look at https://github.com/Martchus/openQA/pull/new/dependency-pinning for how this could be enabled and documented.
apappas wrote in #note-4:
Target version set to future
Can we get either a concrete ETA or a rejection?
The ETA is: Certainly not within the next days or weeks. I don't see why we should reject the feature request. It's a good idea and valid for openQA. The team just doesn't have capacity to work on that anytime soon.
Updated by asmorodskyi 6 months ago
I want to remind you that it is actually rollback to state which we had some years ago when MM tests was ALWAYS running on same host . This was dramatically increasing wait time in queue for MM tests because mixed queue with MM jobs and single jobs hard to catch condition when two worker instances in same worker are free. To resolve this problem GRE bridges was introduced . Now if we will drop this we will get back to old problem so we need to make sure that we address old problem before switching to this mode