action #135773
Updated by okurz over 1 year ago
## Observation
See #134282-1
> There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine.
> There are fails in core group: https://openqa.suse.de/tests/11843205#next_previous
Kernel group: https://openqa.suse.de/tests/11846943#next_previous
HPC: https://openqa.suse.de/tests/11845897#next_previous
The scenario is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ovs-client&version=15-SP5
## Acceptance criteria
* **AC1:** The "ovs-client+ovs-server" test scenario passes consistently when running on multiple OSD workers with "tap" class
## Suggestions
* Check for the current fail ratio of the scenario using https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation when running on
* a *single* physical host (as reference)
* multiple hosts
* Thoroughly read #134282-3
* Read https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.cookbook.mtu-mss.html and check if that is applicable for us
* For easier reproduction+investigation trigger openQA multi-machine clusters with PAUSE_AT, see https://github.com/os-autoinst/os-autoinst/blob/master/doc/backend_vars.asciidoc, e.g. after the systems boot and potentially configured their network or something
* Check for MTU size related problems, e.g. with `ping` using big packet sizes and explicit selections of bridge or tap devices
## Out of scope
* Anything that already fails when the multi-machine cluster runs on a single physical host
* #135035 "Pin multimachine jobs to a single worker"
* Any other test than "ovs-client+server"
* Try to minimize the reproducer, e.g. skip test modules in openQA -> #135818
## Workaround
Pin to a single physical machine