action #45461: [kernel] test fails in t11_vlan_ifdown_modify_one_config - openQA Tests - openSUSE Project Management Tool

Actions

Copy link

action #45461

closed

[kernel] test fails in t11_vlan_ifdown_modify_one_config

Added by asmorodskyi over 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Bugs in existing tests

Target version:

Start date:

2018-12-21

Due date:

% Done:

Estimated time:

Difficulty:

Description

Observation¶

openQA test in scenario sle-15-SP1-Installer-DVD-x86_64-wicked_startandstop_sut@64bit fails in
t11_vlan_ifdown_modify_one_config

failing to ping REF from SUT

Reproducible¶

Fails since (at least) Build 128.1

Expected result¶

Last good: 125.1 (or more recent)

Further details¶

Always latest result in this scenario: latest

Files

Download all files

Multimachine in OpenQA.png (114 KB) Multimachine in OpenQA.png	Multimachine.png	jlausuch, 2019-01-15 12:53
packet headers.png (175 KB) packet headers.png	packets.png	jlausuch, 2019-01-15 12:58

Actions

Copy link

Updated by jlausuch over 5 years ago

After some troubleshooting, I've found out why the ping doesn't work on some environments and does work on others. The reason doesn't depend on the OS at all, it only depends on the host configuration (hypervisor).

In Kimball, we have the following setup for the VMs that are created:

ovs-vsctl show 9ed7aec9-3b5b-4b89-a911-f4c07c8e16f5 Bridge "br1" Port "br1" Interface "br1" type: internal Port "tap0" Interface "tap0" Port "tap1" Interface "tap1" Port "tap2" Interface "tap2"

In Fromm, we had the same but instead of br1, it was br0. The ping didn't work..

However, I modified the interfaces and change the bridge from br0 to br1 and the ping was working.

It seems that we can't really use br0 as stated here [1] "os-autoinst-openvswitch.service uses br0 bridge by default. As it might be used by KVM, configure br1 instead."

I really don't know how os-autoinst is using br0 but the error looked like some kind of conflict between os-autoinst and the VMs trying to use the same bridge (my guess), but don't know how.

What's remaining here is to check how the OSD infra machine is setup, if it's br0 or br1...

[1] http://open.qa/docs/#_multi_machine_tests_setup

Actions

Copy link

Updated by jlausuch over 5 years ago

In Fromm, when I was checking the openflow rules when the bridge was br0, there were more rules than the setup with br1.

For the setup with br0:

ovs-ofctl dump-flows br0 cookie=0x0, duration=1833.656s, table=0, n_packets=4, n_bytes=168, priority=100,arp,arp_tpa=10.0.2.2 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x806,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],output:NXM_OF_IN_PORT[]),load:0xa010000->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[0..15]->NXM_OF_ARP_SPA[0..15],LOCAL cookie=0x0, duration=1833.627s, table=0, n_packets=572, n_bytes=421431, priority=100,ip,dl_dst=16:41:d9:24:5d:41 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x800,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],output:NXM_OF_IN_PORT[]),mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL cookie=0x0, duration=1833.598s, table=0, n_packets=0, n_bytes=0, priority=99,ip,dl_dst=16:41:d9:24:5d:41 actions=mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL cookie=0x0, duration=1833.683s, table=0, n_packets=567, n_bytes=3142743, priority=1,in_port=LOCAL actions=resubmit(,1) cookie=0x0, duration=1833.704s, table=0, n_packets=224, n_bytes=17984, priority=0 actions=NORMAL cookie=0x0, duration=1739.315s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020a->NXM_OF_ARP_TPA[],output:tap0 cookie=0x0, duration=1739.315s, table=1, n_packets=406, n_bytes=2446854, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020a->NXM_OF_IP_DST[],output:tap0 cookie=0x0, duration=1735.286s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:02 actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap1 cookie=0x0, duration=1735.286s, table=1, n_packets=157, n_bytes=695721, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:02 actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap1

For the setup with br1, there is only 1 rule:

ovs-ofctl dump-flows br1 cookie=0x0, duration=1559995.167s, table=0, n_packets=27731, n_bytes=30804955, priority=0 actions=NORMAL

Actions

Copy link

Updated by jlausuch over 5 years ago

Proof run in Fromm using br1
http://fromm.arch.suse.de/tests/4626

Actions

Copy link

Updated by okurz over 5 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: wicked_startandstop_sut
https://openqa.suse.de/tests/2345785

Actions

Copy link

Updated by jlausuch over 5 years ago

After finding out the differences between OSD machine and our openqa environments, we actually had a missconfiguration (in Kimball and Fromm). After correcting the configuration and using br1 as OVS bridge to add the TAP devices for the VMs, the test fails also in Kimball the same way as in OSD (no ping).

This is the flows of br1 when the ping is executed in a loop:

cookie=0x0, duration=2235.692s, table=0, n_packets=11, n_bytes=462, priority=100,arp,arp_tpa=10.0.2.2 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x806,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],output:NXM_OF_IN_PORT[]),load:0xa010000->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[0..15]->NXM_OF_ARP_SPA[0..15],LOCAL
cookie=0x0, duration=2235.685s, table=0, n_packets=701, n_bytes=2127626, priority=100,ip,dl_dst=82:77:18:5b:00:48 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x800,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],output:NXM_OF_IN_PORT[]),mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=2235.679s, table=0, n_packets=0, n_bytes=0, priority=99,ip,dl_dst=82:77:18:5b:00:48 actions=mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=2235.699s, table=0, n_packets=582, n_bytes=55852, priority=1,in_port=LOCAL actions=resubmit(,1)
cookie=0x0, duration=2235.706s, table=0, n_packets=855, n_bytes=59110, priority=0 actions=NORMAL
cookie=0x0, duration=928.360s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:03 actions=load:0xa00020a->NXM_OF_ARP_TPA[],output:tap2
cookie=0x0, duration=928.359s, table=1, n_packets=238, n_bytes=21969, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:03 actions=load:0xa00020a->NXM_OF_IP_DST[],output:tap2
cookie=0x0, duration=795.939s, table=1, n_packets=3, n_bytes=126, priority=100,arp,in_port=LOCAL,dl_dst=42:41:40:3f:3e:3d actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap0
cookie=0x0, duration=795.939s, table=1, n_packets=39, n_bytes=3177, priority=100,ip,in_port=LOCAL,dl_dst=42:41:40:3f:3e:3d actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap0
cookie=0x0, duration=924.028s, table=1, n_packets=294, n_bytes=30244, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap0
cookie=0x0, duration=924.028s, table=1, n_packets=6, n_bytes=252, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap0

The field n_packets increases for each ping try in the line

cookie=0x0, duration=2235.706s, table=0, n_packets=855, n_bytes=59110, priority=0 actions=NORMAL

Actions

Copy link

Updated by jlausuch over 5 years ago

According to the documentation [1]

packets from tapX to br1 create additional rules in table=1
packets from br1 to tapX increase packet counts in table=1

[1] http://open.qa/docs/#_debugging_open_vswitch_configuration

So, we see increasing packets in table0 but not in table1.

Actions

Copy link

Updated by agraul over 5 years ago

Subject changed from test fails in t11_vlan_ifdown_modify_one_config to [kernel] test fails in t11_vlan_ifdown_modify_one_config

Actions

Copy link

Updated by cfconrad over 5 years ago

https://github.com/os-autoinst/os-autoinst/pull/1087

Actions

Copy link Download all files

Updated by jlausuch over 5 years ago

File Multimachine in OpenQA.png Multimachine in OpenQA.png added
File packet headers.png packet headers.png added

After some investigation, the way os-autoinst handles multi-machine scenarios (2 or more parallel jobs) is to assign a 802.1ad TAG to the egress packet from the Virtual Machine. This tag is added by os-autoinst.openvswitch script [1]. This way, OpenQA can run different jobs at the same time even using the same IPs.. so, it's the way to isolate the network between jobs.

Besides that, if the 2 VMs are running on different workers (machines), there is a GRE tunnel to encapsulate traffic from one worker to the other.
So, the network flow would be something like in this .

The problem comes when the original packet comes with a vlan tag from the VM. The TAP device receives this packet and since the TAG is different than the assigned to the port, it will be dropped. So, the communication in this case does not work for the current setup.

The solution is to use VLAN protocol 802.1q, which will insert another VLAN header without changing the original packet, keeping the original tag as part of the packet. It can even encapsulate multiple VLAN tags in the same packet. This picture describes better what we need to achieve.

The way to achieve this in Openvswitch is to set the port with the option vlan_mode=dot1q-tunnel

[1] https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch#L149

Actions

Copy link

#10

Updated by jlausuch over 4 years ago

Status changed from New to Resolved

This was resolved already and forgot to close the ticket by that time.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Tests

Tags

Custom queries

action #45461

[kernel] test fails in t11_vlan_ifdown_modify_one_config

Observation¶

Reproducible¶

Expected result¶

Further details¶

Updated by jlausuch over 5 years ago

Updated by jlausuch over 5 years ago

Updated by jlausuch over 5 years ago

Updated by okurz over 5 years ago

Updated by jlausuch over 5 years ago

Updated by jlausuch over 5 years ago

Updated by agraul over 5 years ago

Updated by cfconrad over 5 years ago

Updated by jlausuch over 5 years ago

Updated by jlausuch over 4 years ago