Project

General

Profile

Actions

action #121789

closed

MultiMachine tests lose ability to communicate

Added by dimstar about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2022-12-10
Due date:
% Done:

100%

Estimated time:
Tags:

Description

Observation

This was already an issue yesterday, appeared again today, so this needs a structural fix to not sure up anymore.

Yesterday, Fabian rebooted ow20, and things worked. Today it stopped working again

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-ovs-server@64bit fails in
ovs_server

Test suite description

Regression test for openvswitch-ipsec. Maintainer: Anna Minou

Reproducible

Fails since (at least) Build 20221209

Expected result

Last good: 20221208 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 2 (1 open1 closed)

Related to openQA Infrastructure (public) - action #115418: Setup ow19+20 to be able to run MM tests size:MResolvedfavogt2022-08-17

Actions
Related to openQA Project (public) - action #122299: openQA worker should fail with explicit error message if multi-machine test is triggered but requirements are not fulfilledNew

Actions
Actions #1

Updated by dimstar about 2 years ago

Ow19 and 20 need to be affected

Tests that ended up on ow4 passed

Actions #2

Updated by okurz about 2 years ago

  • Related to action #115418: Setup ow19+20 to be able to run MM tests size:M added
Actions #3

Updated by okurz about 2 years ago

  • Project changed from openQA Tests (public) to openQA Infrastructure (public)
  • Category deleted (Bugs in existing tests)
  • Target version set to Ready
Actions #4

Updated by livdywan about 2 years ago

Covered briefly in the daily. We'll see if Fabian can look into it on account of having set this up last week, pending response in "factory" - if that doesn't happen I'm prepared to look into it and see what I can figure out

Actions #5

Updated by livdywan about 2 years ago

  • Tags set to infra

Brought up in the infra daily. I assume we consider this infra.

Actions #6

Updated by favogt about 2 years ago

  • Status changed from New to In Progress

There was a test running on ow19 and ow20, with the VMs able to ping each other in both directions and each VM being able to ping their host through 10.0.2.2.
The VM on ow20 was able to reach the outside (beyond the worker), but not the VM on ow19.
Using tcpdump, it was visible that the ICMP echo requests went from the tap device to br0 with the correct IP rewriting (by OVS), but did not end up on eth0.
It turns out that both net.ipv4.conf.eth0.forwarding and net.ipv4.conf.br1.forwarding were set to 0. Changing them to 1 again with sysctl -w restored networking completely.

I'm not sure what the cause is, but:

openqaworker19:~ # cat /etc/sysctl.d/70-yast.conf
net.ipv4.ip_forward = 0
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.all.disable_ipv6 = 0

I deleted that file now. Maybe it's fixed, let's see.

Actions #7

Updated by favogt almost 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

[12:36] DimStar: Did https://progress.opensuse.org/issues/121789 happen again?
[12:36] fvogt: don't think I'd seen that popping up the last few days

Actions #8

Updated by favogt almost 2 years ago

  • Assignee set to favogt
Actions #9

Updated by okurz almost 2 years ago

@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?

Actions #10

Updated by favogt almost 2 years ago

okurz wrote:

@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?

Not sure. The networking setup is fairly complex and I don't really understand all parts either. There's already a section about OVS debugging in the documentation which is somewhat helpful: http://open.qa/docs/#_debugging_open_vswitch_configuration

What I did was applying tcpdump to all interfaces along the path to figure out where it goes wrong.

Some more complete documentation on how MM networking with OVS works would be helpful not only for troubleshooting I'd say. The main missing part is how OVS is configured (IP rewriting) and how it plays together with VLANs, GRE tunnels and masquerading.

Actions #11

Updated by okurz almost 2 years ago

favogt wrote:

okurz wrote:

@favogt great that you could fix it. I am just afraid the next time on problems we will be in a similar situation. Do you have an idea what can be added to the documentation or even better to our software to clearly indicate what the problems are before we start failing tests?

Not sure. The networking setup is fairly complex and I don't really understand all parts either. There's already a section about OVS debugging in the documentation which is somewhat helpful: http://open.qa/docs/#_debugging_open_vswitch_configuration

What I did was applying tcpdump to all interfaces along the path to figure out where it goes wrong.

ok, thx

Some more complete documentation on how MM networking with OVS works would be helpful not only for troubleshooting I'd say. The main missing part is how OVS is configured (IP rewriting) and how it plays together with VLANs, GRE tunnels and masquerading.

yeah, I just don't think there is anyone else right now that feels more confident to write that up than you are :)

Actions #12

Updated by okurz almost 2 years ago

  • Related to action #122299: openQA worker should fail with explicit error message if multi-machine test is triggered but requirements are not fulfilled added
Actions

Also available in: Atom PDF