action #111908
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Multimachine failures between multiple physical workers
Description
Observation¶
There are "random unexpected" MM failures due to some issue between multiple workers.
Below is list of support_server jobs of failed MM HA/SAP jobs in last two weeks.
This jobs I restarted on same openQA worker and they didn't fail.
Same experience I have with local HA/SAP instance, when I use one worker, there are nearly no "random unexpected" failures.
When I use two physical workers, the rate of "random unexpected" failures does increase.
https://openqa.suse.de/tests/8804890#dependencies
https://openqa.suse.de/tests/8804876#dependencies
https://openqa.suse.de/tests/8804944#dependencies
https://openqa.suse.de/tests/8796653#dependencies
https://openqa.suse.de/tests/8806626#dependencies
https://openqa.suse.de/tests/8813734#dependencies
https://openqa.suse.de/tests/8819834#dependencies
https://openqa.suse.de/tests/8818172#dependencies
https://openqa.suse.de/tests/8818165#dependencies
https://openqa.suse.de/tests/8825849#dependencies
https://openqa.suse.de/tests/8842164#dependencies
https://openqa.suse.de/tests/8844261#dependencies
https://openqa.suse.de/tests/8855774#dependencies
https://openqa.suse.de/tests/8856411#dependencies
Steps to reproduce¶
The failures are random, I could reproduce this failures on local instance with multiple physical worker.
Problem¶
I assume it's network/openvswitch/GRE issue between servers.
Workaround¶
Run the jobs on one physical worker via WORKER_CLASS e.g. WORKER_CLASS=qemu_x86_64,tap,openqaworker8
Updated by okurz over 2 years ago
- Target version set to Ready
- Parent task set to #103962
Hi dzedro, could you help us and extend the ticket description according to the template https://progress.opensuse.org/projects/openqav3/wiki/#Defects especially the "steps to reproduce". E.g. something like openqa-clone-job
commands or some openqa-cli calls that you might have used to trigger tests.
Updated by okurz over 2 years ago
- Project changed from openQA Infrastructure to openQA Project
- Target version changed from Ready to future
@dzedro thanks for the detailed extension of the ticket. I agree with the workaround. For now I suggest to follow the workaround as we don't have the capacity to look into the specific problem so all affected tests should be configured according to the mentioned workaround.
Updated by livdywan about 1 year ago
- Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Updated by livdywan about 1 year ago
Re-reading http://open.qa/docs/#_gre_tunnels just now:
As long as the SUT has access to external network, there should be a non-zero packet count in the forward chain between the br1 and external interface.
sudo salt -C 'worker*' cmd.run 'iptables --list --verbose | grep FORWARD'
worker29.oqa.prg2.suse.org:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker39.oqa.prg2.suse.org:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker40.oqa.prg2.suse.org:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker30.oqa.prg2.suse.org:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker37.oqa.prg2.suse.org:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker38.oqa.prg2.suse.org:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker-arm1.oqa.prg2.suse.org:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker-arm2.oqa.prg2.suse.org:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker3.oqa.suse.de:
Chain FORWARD (policy ACCEPT 34524 packets, 40M bytes)
worker2.oqa.suse.de:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
worker5.oqa.suse.de:
Chain FORWARD (policy ACCEPT 7461 packets, 3820K bytes)
worker8.oqa.suse.de:
Chain FORWARD (policy ACCEPT 8591 packets, 12M bytes)
worker10.oqa.suse.de:
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
To me it reads like GRE tunnels won't work on machines showing 0 packets
. Maybe this is something that could be validated as part of deployments in salt?
Updated by okurz 3 months ago
- Category set to Feature requests
- Status changed from New to Resolved
- Assignee set to okurz
- Target version changed from future to Ready
With #112001 and other related tickets we could improve, prevent original issues, apply mitigations as well are able to quickly workaround similar problems in the future, e.g. by running multi machine tests on one host each in case of problems