action #121789
closedMultiMachine tests lose ability to communicate
100%
Description
Observation¶
This was already an issue yesterday, appeared again today, so this needs a structural fix to not sure up anymore.
Yesterday, Fabian rebooted ow20, and things worked. Today it stopped working again
openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-ovs-server@64bit fails in
ovs_server
Test suite description¶
Regression test for openvswitch-ipsec. Maintainer: Anna Minou
Reproducible¶
Fails since (at least) Build 20221209
Expected result¶
Last good: 20221208 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by dimstar about 2 years ago
Ow19 and 20 need to be affected
Tests that ended up on ow4 passed
Updated by okurz about 2 years ago
- Related to action #115418: Setup ow19+20 to be able to run MM tests size:M added
Updated by okurz about 2 years ago
- Project changed from openQA Tests (public) to openQA Infrastructure (public)
- Category deleted (
Bugs in existing tests) - Target version set to Ready
Updated by livdywan about 2 years ago
Covered briefly in the daily. We'll see if Fabian can look into it on account of having set this up last week, pending response in "factory" - if that doesn't happen I'm prepared to look into it and see what I can figure out
Updated by livdywan about 2 years ago
- Tags set to infra
Brought up in the infra daily. I assume we consider this infra.
Updated by favogt about 2 years ago
- Status changed from New to In Progress
There was a test running on ow19 and ow20, with the VMs able to ping each other in both directions and each VM being able to ping their host through 10.0.2.2.
The VM on ow20 was able to reach the outside (beyond the worker), but not the VM on ow19.
Using tcpdump, it was visible that the ICMP echo requests went from the tap device to br0 with the correct IP rewriting (by OVS), but did not end up on eth0.
It turns out that both net.ipv4.conf.eth0.forwarding
and net.ipv4.conf.br1.forwarding
were set to 0. Changing them to 1
again with sysctl -w
restored networking completely.
I'm not sure what the cause is, but:
openqaworker19:~ # cat /etc/sysctl.d/70-yast.conf
net.ipv4.ip_forward = 0
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.all.disable_ipv6 = 0
I deleted that file now. Maybe it's fixed, let's see.
Updated by favogt almost 2 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
[12:36] DimStar: Did https://progress.opensuse.org/issues/121789 happen again?
[12:36] fvogt: don't think I'd seen that popping up the last few days
Updated by okurz almost 2 years ago
@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
Updated by favogt almost 2 years ago
okurz wrote:
@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
Not sure. The networking setup is fairly complex and I don't really understand all parts either. There's already a section about OVS debugging in the documentation which is somewhat helpful: http://open.qa/docs/#_debugging_open_vswitch_configuration
What I did was applying tcpdump to all interfaces along the path to figure out where it goes wrong.
Some more complete documentation on how MM networking with OVS works would be helpful not only for troubleshooting I'd say. The main missing part is how OVS is configured (IP rewriting) and how it plays together with VLANs, GRE tunnels and masquerading.
Updated by okurz almost 2 years ago
favogt wrote:
okurz wrote:
@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?
Not sure. The networking setup is fairly complex and I don't really understand all parts either. There's already a section about OVS debugging in the documentation which is somewhat helpful: http://open.qa/docs/#_debugging_open_vswitch_configuration
What I did was applying tcpdump to all interfaces along the path to figure out where it goes wrong.
ok, thx
Some more complete documentation on how MM networking with OVS works would be helpful not only for troubleshooting I'd say. The main missing part is how OVS is configured (IP rewriting) and how it plays together with VLANs, GRE tunnels and masquerading.
yeah, I just don't think there is anyone else right now that feels more confident to write that up than you are :)
Updated by okurz almost 2 years ago
- Related to action #122299: openQA worker should fail with explicit error message if multi-machine test is triggered but requirements are not fulfilled added