action #121789
closed
MultiMachine tests lose ability to communicate
Added by dimstar about 2 years ago.
Updated almost 2 years ago.
Description
Observation¶
This was already an issue yesterday, appeared again today, so this needs a structural fix to not sure up anymore.
Yesterday, Fabian rebooted ow20, and things worked. Today it stopped working again
openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-ovs-server@64bit fails in
ovs_server
Test suite description¶
Regression test for openvswitch-ipsec. Maintainer: Anna Minou
Reproducible¶
Fails since (at least) Build 20221209
Expected result¶
Last good: 20221208 (or more recent)
Further details¶
Always latest result in this scenario: latest
Ow19 and 20 need to be affected
Tests that ended up on ow4 passed
- Related to action #115418: Setup ow19+20 to be able to run MM tests size:M added
- Project changed from openQA Tests (public) to openQA Infrastructure (public)
- Category deleted (
Bugs in existing tests)
- Target version set to Ready
Covered briefly in the daily. We'll see if Fabian can look into it on account of having set this up last week, pending response in "factory" - if that doesn't happen I'm prepared to look into it and see what I can figure out
Brought up in the infra daily. I assume we consider this infra.
- Status changed from New to In Progress
There was a test running on ow19 and ow20, with the VMs able to ping each other in both directions and each VM being able to ping their host through 10.0.2.2.
The VM on ow20 was able to reach the outside (beyond the worker), but not the VM on ow19.
Using tcpdump, it was visible that the ICMP echo requests went from the tap device to br0 with the correct IP rewriting (by OVS), but did not end up on eth0.
It turns out that both net.ipv4.conf.eth0.forwarding
and net.ipv4.conf.br1.forwarding
were set to 0. Changing them to 1
again with sysctl -w
restored networking completely.
I'm not sure what the cause is, but:
openqaworker19:~ # cat /etc/sysctl.d/70-yast.conf
net.ipv4.ip_forward = 0
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.all.disable_ipv6 = 0
I deleted that file now. Maybe it's fixed, let's see.
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
okurz wrote:
@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
Not sure. The networking setup is fairly complex and I don't really understand all parts either. There's already a section about OVS debugging in the documentation which is somewhat helpful: http://open.qa/docs/#_debugging_open_vswitch_configuration
What I did was applying tcpdump to all interfaces along the path to figure out where it goes wrong.
Some more complete documentation on how MM networking with OVS works would be helpful not only for troubleshooting I'd say. The main missing part is how OVS is configured (IP rewriting) and how it plays together with VLANs, GRE tunnels and masquerading.
favogt wrote:
okurz wrote:
@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
Not sure. The networking setup is fairly complex and I don't really understand all parts either. There's already a section about OVS debugging in the documentation which is somewhat helpful: http://open.qa/docs/#_debugging_open_vswitch_configuration
What I did was applying tcpdump to all interfaces along the path to figure out where it goes wrong.
ok, thx
Some more complete documentation on how MM networking with OVS works would be helpful not only for troubleshooting I'd say. The main missing part is how OVS is configured (IP rewriting) and how it plays together with VLANs, GRE tunnels and masquerading.
yeah, I just don't think there is anyone else right now that feels more confident to write that up than you are :)
- Related to action #122299: openQA worker should fail with explicit error message if multi-machine test is triggered but requirements are not fulfilled added
Also available in: Atom
PDF