action #111908
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Multimachine failures between multiple physical workers
Description
Observation¶
There are "random unexpected" MM failures due to some issue between multiple workers.
Below is list of support_server jobs of failed MM HA/SAP jobs in last two weeks.
This jobs I restarted on same openQA worker and they didn't fail.
Same experience I have with local HA/SAP instance, when I use one worker, there are nearly no "random unexpected" failures.
When I use two physical workers, the rate of "random unexpected" failures does increase.
https://openqa.suse.de/tests/8804890#dependencies
https://openqa.suse.de/tests/8804876#dependencies
https://openqa.suse.de/tests/8804944#dependencies
https://openqa.suse.de/tests/8796653#dependencies
https://openqa.suse.de/tests/8806626#dependencies
https://openqa.suse.de/tests/8813734#dependencies
https://openqa.suse.de/tests/8819834#dependencies
https://openqa.suse.de/tests/8818172#dependencies
https://openqa.suse.de/tests/8818165#dependencies
https://openqa.suse.de/tests/8825849#dependencies
https://openqa.suse.de/tests/8842164#dependencies
https://openqa.suse.de/tests/8844261#dependencies
https://openqa.suse.de/tests/8855774#dependencies
https://openqa.suse.de/tests/8856411#dependencies
Steps to reproduce¶
The failures are random, I could reproduce this failures on local instance with multiple physical worker.
Problem¶
I assume it's network/openvswitch/GRE issue between servers.
Workaround¶
Run the jobs on one physical worker via WORKER_CLASS e.g. WORKER_CLASS=qemu_x86_64,tap,openqaworker8