action #135773
closed[tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:M
0%
Description
Observation¶
See #134282-1
There is something wrong with multimachine network when tests are run across different workers. If is multimachine job forced to run on same worker, it is fine.
There are fails in core group: https://openqa.suse.de/tests/11843205#next_previous
Kernel group: https://openqa.suse.de/tests/11846943#next_previous
HPC: https://openqa.suse.de/tests/11845897#next_previous
The scenario is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ovs-client&version=15-SP5
Acceptance criteria¶
- AC1: The "ovs-client+ovs-server" test scenario passes consistently when running on multiple OSD workers with "tap" class
Suggestions¶
- Check for the current fail ratio of the scenario using https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation when running on
- a single physical host (as reference)
- multiple hosts
- Thoroughly read #134282-3
- Read https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.cookbook.mtu-mss.html and check if that is applicable for us
- For easier reproduction+investigation trigger openQA multi-machine clusters with PAUSE_AT, see https://github.com/os-autoinst/os-autoinst/blob/master/doc/backend_vars.asciidoc, e.g. after the systems boot and potentially configured their network or something
- Check for MTU size related problems, e.g. with
ping
using big packet sizes and explicit selections of bridge or tap devices
Out of scope¶
- Anything that already fails when the multi-machine cluster runs on a single physical host
- #135035 "Pin multimachine jobs to a single worker"
- Any other test than "ovs-client+server"
- Try to minimize the reproducer, e.g. skip test modules in openQA -> #135818
Workaround¶
Pin to a single physical machine
Updated by okurz over 1 year ago
- Copied from action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Updated by okurz over 1 year ago
- Subject changed from [tools] many multi-machine test failures when tests are run across different workers to [tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan about 1 year ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
The scenario is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ovs-client&version=15-SP5
No recent failures however there's one job specifically scheduled on worker31 which is not currently in production.
Updated by openqa_review about 1 year ago
- Due date set to 2023-10-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan about 1 year ago
livdywan wrote in #note-3:
The scenario is https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ovs-client&version=15-SP5
Jobs are looking good currently e.g. on worker30+29 or worker30+37 (including on worker31 #135407#note-15).
There's one job scheduled specifically for worker31 https://openqa.suse.de/tests/12268287 which won't work because tap was taken out, hence I'm cloning that w/o the worker class.
Edit: https://openqa.suse.de/tests/12268107
Updated by livdywan about 1 year ago
Running ovs-client/server on production mm workers/31 respectively to test stability. Second build has the worker selection reversed in case it makes a difference:
- https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=poo135773_prod_spare
- https://openqa.suse.de/tests/overview?build=poo135773_prod_spare_flip&distri=sle&version=15-SP5
Running this was a bit tedious. What I ended up doing was using --export-command
to prepare a -X POST job
command, and then setting especially these variables:
'BUILD=poo135773_prod_spare_flip'
"TEST:12253260=ovs-server-$i"
"TEST:12253438=ovs-client-$i"
'WORKER_CLASS:12253260=qemu_x86_64,tap_poo135407,worker31'
'WORKER_CLASS:12253438=qemu_x86_64,tap'
Updated by livdywan about 1 year ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
I'm scheduling another batch (same group, higher indices). Let's see if jobs continue to be fine.
Updated by livdywan about 1 year ago
livdywan wrote in #note-7:
I'm scheduling another batch (same group, higher indices). Let's see if jobs continue to be fine.
Well, turns out some of the repos were removed rendering the batch pointless since they all fail to install some of those repos and this is at best checked when initially cloning which I was not doing due to my use of --export-command
.
Updated by okurz about 1 year ago
- https://openqa.suse.de/tests/overview?distri=sle&version=15-SP5&build=poo135773_prod_spare 43/43=100% passed
- https://openqa.suse.de/tests/overview?build=poo135773_prod_spare_flip&distri=sle&version=15-SP5 100/100=100% passed (001..100)
so we are good regarding worker31. Yes, triggering those tests and tweaking is ugly. What I did now is use a SLE15-SP6 base which is less likely to need changes regarding repo settings and might be a good variant to test as well. What I did:
openqa-clone-job --export-command --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12207584
then copy-paste the command into the command line, delete all variables "BUILD", "_GROUP_ID", "TEST", "WORKER_CLASS" and replace those accordingly with
'TEST:12207583=ovs-client-poo135773' 'TEST:12207584=ovs-server-poo135773' 'WORKER_CLASS:12207583=qemu_x86_64,tap' 'WORKER_CLASS:12207584=worker31' _GROUP=0 BUILD=poo135773_prod_spare_flip
forming the complete command.
This triggered https://openqa.suse.de/tests/12343491
@livdywan please ensure that the workaround of "pin to individual machines" is removed everywhere and reference in #134282 how to verify other workers.
Updated by livdywan about 1 year ago
@livdywan please ensure that the workaround of "pin to individual machines" is removed everywhere and reference in #134282 how to verify other workers.
There's no remaining pinning for this ticket in pillars/workerconf.sls (there are for #136130 and #135407 which is expected) and none in job groups (note that GitLab can't search this repo, it needs to be cloned with subrepos).
Updated by livdywan about 1 year ago
- Status changed from Feedback to Resolved