Project

General

Profile

action #135773

Updated by okurz 10 months ago

## Observation 
 See #134282-1 
 > There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine. 

 > There are fails in core group:    https://openqa.suse.de/tests/11843205#next_previous 
 Kernel group: https://openqa.suse.de/tests/11846943#next_previous 
 HPC: https://openqa.suse.de/tests/11845897#next_previous 

 The scenario is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ovs-client&version=15-SP5 

 ## Acceptance criteria 
 * **AC1:** The "ovs-client+ovs-server" test scenario passes consistently when running on multiple OSD workers with "tap" class 

 ## Suggestions 
 * Check for the current fail ratio of the scenario using https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation when running on 
   * a *single* physical host (as reference) 
   * multiple hosts 
 * Thoroughly read #134282-3 
 * Read https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.cookbook.mtu-mss.html and check if that is applicable for us 
 * For easier reproduction+investigation trigger openQA multi-machine clusters with PAUSE_AT, see https://github.com/os-autoinst/os-autoinst/blob/master/doc/backend_vars.asciidoc, e.g. after the systems boot and potentially configured their network or something 
 * Check for MTU size related problems, e.g. with `ping` using big packet sizes and explicit selections of bridge or tap devices 

 ## Out of scope 
 * Anything that already fails when the multi-machine cluster runs on a single physical host 
 * #135035 "Pin multimachine jobs to a single worker" 
 * Any other test than "ovs-client+server" 
 * Try to minimize the reproducer, e.g. skip test modules in openQA -> #135818 

 ## Workaround 
 Pin to a single physical machine

Back