Project

General

Profile

Actions

action #45461

closed

[kernel] test fails in t11_vlan_ifdown_modify_one_config

Added by asmorodskyi over 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
2018-12-21
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP1-Installer-DVD-x86_64-wicked_startandstop_sut@64bit fails in
t11_vlan_ifdown_modify_one_config

failing to ping REF from SUT

Reproducible

Fails since (at least) Build 128.1

Expected result

Last good: 125.1 (or more recent)

Further details

Always latest result in this scenario: latest


Files

Multimachine in OpenQA.png (114 KB) Multimachine in OpenQA.png Multimachine.png jlausuch, 2019-01-15 12:53
packet headers.png (175 KB) packet headers.png packets.png jlausuch, 2019-01-15 12:58
Actions #1

Updated by jlausuch over 5 years ago

After some troubleshooting, I've found out why the ping doesn't work on some environments and does work on others. The reason doesn't depend on the OS at all, it only depends on the host configuration (hypervisor).

In Kimball, we have the following setup for the VMs that are created:

ovs-vsctl show
9ed7aec9-3b5b-4b89-a911-f4c07c8e16f5
Bridge "br1"
Port "br1"
Interface "br1"
type: internal
Port "tap0"
Interface "tap0"
Port "tap1"
Interface "tap1"
Port "tap2"
Interface "tap2"

In Fromm, we had the same but instead of br1, it was br0. The ping didn't work..

However, I modified the interfaces and change the bridge from br0 to br1 and the ping was working.

It seems that we can't really use br0 as stated here [1] "os-autoinst-openvswitch.service uses br0 bridge by default. As it might be used by KVM, configure br1 instead."

I really don't know how os-autoinst is using br0 but the error looked like some kind of conflict between os-autoinst and the VMs trying to use the same bridge (my guess), but don't know how.

What's remaining here is to check how the OSD infra machine is setup, if it's br0 or br1...

[1] http://open.qa/docs/#_multi_machine_tests_setup

Actions #2

Updated by jlausuch over 5 years ago

In Fromm, when I was checking the openflow rules when the bridge was br0, there were more rules than the setup with br1.

For the setup with br0:

ovs-ofctl dump-flows br0
cookie=0x0, duration=1833.656s, table=0, n_packets=4, n_bytes=168, priority=100,arp,arp_tpa=10.0.2.2 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x806,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],output:NXM_OF_IN_PORT[]),load:0xa010000->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[0..15]->NXM_OF_ARP_SPA[0..15],LOCAL
cookie=0x0, duration=1833.627s, table=0, n_packets=572, n_bytes=421431, priority=100,ip,dl_dst=16:41:d9:24:5d:41 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x800,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],output:NXM_OF_IN_PORT[]),mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=1833.598s, table=0, n_packets=0, n_bytes=0, priority=99,ip,dl_dst=16:41:d9:24:5d:41 actions=mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=1833.683s, table=0, n_packets=567, n_bytes=3142743, priority=1,in_port=LOCAL actions=resubmit(,1)
cookie=0x0, duration=1833.704s, table=0, n_packets=224, n_bytes=17984, priority=0 actions=NORMAL
cookie=0x0, duration=1739.315s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020a->NXM_OF_ARP_TPA[],output:tap0
cookie=0x0, duration=1739.315s, table=1, n_packets=406, n_bytes=2446854, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020a->NXM_OF_IP_DST[],output:tap0
cookie=0x0, duration=1735.286s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:02 actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap1
cookie=0x0, duration=1735.286s, table=1, n_packets=157, n_bytes=695721, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:02 actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap1

For the setup with br1, there is only 1 rule:

ovs-ofctl dump-flows br1
cookie=0x0, duration=1559995.167s, table=0, n_packets=27731, n_bytes=30804955, priority=0 actions=NORMAL

Actions #3

Updated by jlausuch over 5 years ago

Proof run in Fromm using br1
http://fromm.arch.suse.de/tests/4626

Actions #4

Updated by okurz over 5 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: wicked_startandstop_sut
https://openqa.suse.de/tests/2345785

Actions #5

Updated by jlausuch over 5 years ago

After finding out the differences between OSD machine and our openqa environments, we actually had a missconfiguration (in Kimball and Fromm). After correcting the configuration and using br1 as OVS bridge to add the TAP devices for the VMs, the test fails also in Kimball the same way as in OSD (no ping).

This is the flows of br1 when the ping is executed in a loop:

cookie=0x0, duration=2235.692s, table=0, n_packets=11, n_bytes=462, priority=100,arp,arp_tpa=10.0.2.2 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x806,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],output:NXM_OF_IN_PORT[]),load:0xa010000->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[0..15]->NXM_OF_ARP_SPA[0..15],LOCAL
cookie=0x0, duration=2235.685s, table=0, n_packets=701, n_bytes=2127626, priority=100,ip,dl_dst=82:77:18:5b:00:48 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x800,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],output:NXM_OF_IN_PORT[]),mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=2235.679s, table=0, n_packets=0, n_bytes=0, priority=99,ip,dl_dst=82:77:18:5b:00:48 actions=mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=2235.699s, table=0, n_packets=582, n_bytes=55852, priority=1,in_port=LOCAL actions=resubmit(,1)
cookie=0x0, duration=2235.706s, table=0, n_packets=855, n_bytes=59110, priority=0 actions=NORMAL
cookie=0x0, duration=928.360s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:03 actions=load:0xa00020a->NXM_OF_ARP_TPA[],output:tap2
cookie=0x0, duration=928.359s, table=1, n_packets=238, n_bytes=21969, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:03 actions=load:0xa00020a->NXM_OF_IP_DST[],output:tap2
cookie=0x0, duration=795.939s, table=1, n_packets=3, n_bytes=126, priority=100,arp,in_port=LOCAL,dl_dst=42:41:40:3f:3e:3d actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap0
cookie=0x0, duration=795.939s, table=1, n_packets=39, n_bytes=3177, priority=100,ip,in_port=LOCAL,dl_dst=42:41:40:3f:3e:3d actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap0
cookie=0x0, duration=924.028s, table=1, n_packets=294, n_bytes=30244, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap0
cookie=0x0, duration=924.028s, table=1, n_packets=6, n_bytes=252, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap0

The field n_packets increases for each ping try in the line

cookie=0x0, duration=2235.706s, table=0, n_packets=855, n_bytes=59110, priority=0 actions=NORMAL
Actions #6

Updated by jlausuch over 5 years ago

According to the documentation [1]

  • packets from tapX to br1 create additional rules in table=1
  • packets from br1 to tapX increase packet counts in table=1

[1] http://open.qa/docs/#_debugging_open_vswitch_configuration

So, we see increasing packets in table0 but not in table1.

Actions #7

Updated by agraul over 5 years ago

  • Subject changed from test fails in t11_vlan_ifdown_modify_one_config to [kernel] test fails in t11_vlan_ifdown_modify_one_config

Updated by jlausuch over 5 years ago

After some investigation, the way os-autoinst handles multi-machine scenarios (2 or more parallel jobs) is to assign a 802.1ad TAG to the egress packet from the Virtual Machine. This tag is added by os-autoinst.openvswitch script [1]. This way, OpenQA can run different jobs at the same time even using the same IPs.. so, it's the way to isolate the network between jobs.

Besides that, if the 2 VMs are running on different workers (machines), there is a GRE tunnel to encapsulate traffic from one worker to the other.
So, the network flow would be something like in this picture.

The problem comes when the original packet comes with a vlan tag from the VM. The TAP device receives this packet and since the TAG is different than the assigned to the port, it will be dropped. So, the communication in this case does not work for the current setup.

The solution is to use VLAN protocol 802.1q, which will insert another VLAN header without changing the original packet, keeping the original tag as part of the packet. It can even encapsulate multiple VLAN tags in the same packet. This picture describes better what we need to achieve.

The way to achieve this in Openvswitch is to set the port with the option vlan_mode=dot1q-tunnel

[1] https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch#L149

Actions #10

Updated by jlausuch over 4 years ago

  • Status changed from New to Resolved

This was resolved already and forgot to close the ticket by that time.

Actions

Also available in: Atom PDF