action #121789
MultiMachine tests lose ability to communicate
100%
Description
Observation¶
This was already an issue yesterday, appeared again today, so this needs a structural fix to not sure up anymore.
Yesterday, Fabian rebooted ow20, and things worked. Today it stopped working again
openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-ovs-server@64bit fails in
ovs_server
Test suite description¶
Regression test for openvswitch-ipsec. Maintainer: Anna Minou
Reproducible¶
Fails since (at least) Build 20221209
Expected result¶
Last good: 20221208 (or more recent)
Further details¶
Always latest result in this scenario: latest
Related issues
History
#2
Updated by okurz 3 months ago
- Related to action #115418: Setup ow19+20 to be able to run MM tests size:M added
#6
Updated by favogt 3 months ago
- Status changed from New to In Progress
There was a test running on ow19 and ow20, with the VMs able to ping each other in both directions and each VM being able to ping their host through 10.0.2.2.
The VM on ow20 was able to reach the outside (beyond the worker), but not the VM on ow19.
Using tcpdump, it was visible that the ICMP echo requests went from the tap device to br0 with the correct IP rewriting (by OVS), but did not end up on eth0.
It turns out that both net.ipv4.conf.eth0.forwarding
and net.ipv4.conf.br1.forwarding
were set to 0. Changing them to 1
again with sysctl -w
restored networking completely.
I'm not sure what the cause is, but:
openqaworker19:~ # cat /etc/sysctl.d/70-yast.conf net.ipv4.ip_forward = 0 net.ipv6.conf.all.forwarding = 0 net.ipv6.conf.all.disable_ipv6 = 0
I deleted that file now. Maybe it's fixed, let's see.
#7
Updated by favogt 3 months ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
[12:36] DimStar: Did https://progress.opensuse.org/issues/121789 happen again?
[12:36] fvogt: don't think I'd seen that popping up the last few days
#9
Updated by okurz 3 months ago
favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
#10
Updated by favogt 3 months ago
okurz wrote:
favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
Not sure. The networking setup is fairly complex and I don't really understand all parts either. There's already a section about OVS debugging in the documentation which is somewhat helpful: http://open.qa/docs/#_debugging_open_vswitch_configuration
What I did was applying tcpdump to all interfaces along the path to figure out where it goes wrong.
Some more complete documentation on how MM networking with OVS works would be helpful not only for troubleshooting I'd say. The main missing part is how OVS is configured (IP rewriting) and how it plays together with VLANs, GRE tunnels and masquerading.
#11
Updated by okurz 3 months ago
favogt wrote:
okurz wrote:
favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
Not sure. The networking setup is fairly complex and I don't really understand all parts either. There's already a section about OVS debugging in the documentation which is somewhat helpful: http://open.qa/docs/#_debugging_open_vswitch_configuration
What I did was applying tcpdump to all interfaces along the path to figure out where it goes wrong.
ok, thx
Some more complete documentation on how MM networking with OVS works would be helpful not only for troubleshooting I'd say. The main missing part is how OVS is configured (IP rewriting) and how it plays together with VLANs, GRE tunnels and masquerading.
yeah, I just don't think there is anyone else right now that feels more confident to write that up than you are :)
#12
Updated by okurz 3 months ago
- Related to action #122299: openQA worker should fail with explicit error message if multi-machine test is triggered but requirements are not fulfilled added