action #45461
closed[kernel] test fails in t11_vlan_ifdown_modify_one_config
0%
Description
Observation¶
openQA test in scenario sle-15-SP1-Installer-DVD-x86_64-wicked_startandstop_sut@64bit fails in
t11_vlan_ifdown_modify_one_config
failing to ping REF from SUT
Reproducible¶
Fails since (at least) Build 128.1
Expected result¶
Last good: 125.1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Files
Updated by jlausuch over 5 years ago
After some troubleshooting, I've found out why the ping doesn't work on some environments and does work on others. The reason doesn't depend on the OS at all, it only depends on the host configuration (hypervisor).
In Kimball, we have the following setup for the VMs that are created:
ovs-vsctl show
9ed7aec9-3b5b-4b89-a911-f4c07c8e16f5
Bridge "br1"
Port "br1"
Interface "br1"
type: internal
Port "tap0"
Interface "tap0"
Port "tap1"
Interface "tap1"
Port "tap2"
Interface "tap2"
In Fromm, we had the same but instead of br1, it was br0. The ping didn't work..
However, I modified the interfaces and change the bridge from br0 to br1 and the ping was working.
It seems that we can't really use br0 as stated here [1] "os-autoinst-openvswitch.service uses br0 bridge by default. As it might be used by KVM, configure br1 instead."
I really don't know how os-autoinst is using br0 but the error looked like some kind of conflict between os-autoinst and the VMs trying to use the same bridge (my guess), but don't know how.
What's remaining here is to check how the OSD infra machine is setup, if it's br0 or br1...
Updated by jlausuch over 5 years ago
In Fromm, when I was checking the openflow rules when the bridge was br0, there were more rules than the setup with br1.
For the setup with br0:
ovs-ofctl dump-flows br0
cookie=0x0, duration=1833.656s, table=0, n_packets=4, n_bytes=168, priority=100,arp,arp_tpa=10.0.2.2 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x806,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],output:NXM_OF_IN_PORT[]),load:0xa010000->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[0..15]->NXM_OF_ARP_SPA[0..15],LOCAL
cookie=0x0, duration=1833.627s, table=0, n_packets=572, n_bytes=421431, priority=100,ip,dl_dst=16:41:d9:24:5d:41 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x800,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],output:NXM_OF_IN_PORT[]),mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=1833.598s, table=0, n_packets=0, n_bytes=0, priority=99,ip,dl_dst=16:41:d9:24:5d:41 actions=mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=1833.683s, table=0, n_packets=567, n_bytes=3142743, priority=1,in_port=LOCAL actions=resubmit(,1)
cookie=0x0, duration=1833.704s, table=0, n_packets=224, n_bytes=17984, priority=0 actions=NORMAL
cookie=0x0, duration=1739.315s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020a->NXM_OF_ARP_TPA[],output:tap0
cookie=0x0, duration=1739.315s, table=1, n_packets=406, n_bytes=2446854, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020a->NXM_OF_IP_DST[],output:tap0
cookie=0x0, duration=1735.286s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:02 actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap1
cookie=0x0, duration=1735.286s, table=1, n_packets=157, n_bytes=695721, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:02 actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap1
For the setup with br1, there is only 1 rule:
ovs-ofctl dump-flows br1
cookie=0x0, duration=1559995.167s, table=0, n_packets=27731, n_bytes=30804955, priority=0 actions=NORMAL
Updated by jlausuch over 5 years ago
Proof run in Fromm using br1
http://fromm.arch.suse.de/tests/4626
Updated by okurz over 5 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: wicked_startandstop_sut
https://openqa.suse.de/tests/2345785
Updated by jlausuch over 5 years ago
After finding out the differences between OSD machine and our openqa environments, we actually had a missconfiguration (in Kimball and Fromm). After correcting the configuration and using br1 as OVS bridge to add the TAP devices for the VMs, the test fails also in Kimball the same way as in OSD (no ping).
This is the flows of br1 when the ping is executed in a loop:
cookie=0x0, duration=2235.692s, table=0, n_packets=11, n_bytes=462, priority=100,arp,arp_tpa=10.0.2.2 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x806,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],output:NXM_OF_IN_PORT[]),load:0xa010000->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[0..15]->NXM_OF_ARP_SPA[0..15],LOCAL
cookie=0x0, duration=2235.685s, table=0, n_packets=701, n_bytes=2127626, priority=100,ip,dl_dst=82:77:18:5b:00:48 actions=learn(table=1,priority=100,in_port=LOCAL,eth_type=0x800,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IP_SRC[]->NXM_OF_IP_DST[],output:NXM_OF_IN_PORT[]),mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=2235.679s, table=0, n_packets=0, n_bytes=0, priority=99,ip,dl_dst=82:77:18:5b:00:48 actions=mod_nw_src:10.1.0.0,move:NXM_OF_ETH_SRC[0..15]->NXM_OF_IP_SRC[0..15],LOCAL
cookie=0x0, duration=2235.699s, table=0, n_packets=582, n_bytes=55852, priority=1,in_port=LOCAL actions=resubmit(,1)
cookie=0x0, duration=2235.706s, table=0, n_packets=855, n_bytes=59110, priority=0 actions=NORMAL
cookie=0x0, duration=928.360s, table=1, n_packets=2, n_bytes=84, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:03 actions=load:0xa00020a->NXM_OF_ARP_TPA[],output:tap2
cookie=0x0, duration=928.359s, table=1, n_packets=238, n_bytes=21969, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:03 actions=load:0xa00020a->NXM_OF_IP_DST[],output:tap2
cookie=0x0, duration=795.939s, table=1, n_packets=3, n_bytes=126, priority=100,arp,in_port=LOCAL,dl_dst=42:41:40:3f:3e:3d actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap0
cookie=0x0, duration=795.939s, table=1, n_packets=39, n_bytes=3177, priority=100,ip,in_port=LOCAL,dl_dst=42:41:40:3f:3e:3d actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap0
cookie=0x0, duration=924.028s, table=1, n_packets=294, n_bytes=30244, priority=100,ip,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020b->NXM_OF_IP_DST[],output:tap0
cookie=0x0, duration=924.028s, table=1, n_packets=6, n_bytes=252, priority=100,arp,in_port=LOCAL,dl_dst=52:54:00:12:00:01 actions=load:0xa00020b->NXM_OF_ARP_TPA[],output:tap0
The field n_packets increases for each ping try in the line
cookie=0x0, duration=2235.706s, table=0, n_packets=855, n_bytes=59110, priority=0 actions=NORMAL
Updated by jlausuch over 5 years ago
According to the documentation [1]
- packets from tapX to br1 create additional rules in table=1
- packets from br1 to tapX increase packet counts in table=1
[1] http://open.qa/docs/#_debugging_open_vswitch_configuration
So, we see increasing packets in table0 but not in table1.
Updated by agraul over 5 years ago
- Subject changed from test fails in t11_vlan_ifdown_modify_one_config to [kernel] test fails in t11_vlan_ifdown_modify_one_config
Updated by cfconrad over 5 years ago
Updated by jlausuch over 5 years ago
- File Multimachine in OpenQA.png Multimachine in OpenQA.png added
- File packet headers.png packet headers.png added
After some investigation, the way os-autoinst handles multi-machine scenarios (2 or more parallel jobs) is to assign a 802.1ad TAG to the egress packet from the Virtual Machine. This tag is added by os-autoinst.openvswitch script [1]. This way, OpenQA can run different jobs at the same time even using the same IPs.. so, it's the way to isolate the network between jobs.
Besides that, if the 2 VMs are running on different workers (machines), there is a GRE tunnel to encapsulate traffic from one worker to the other.
So, the network flow would be something like in this .
The problem comes when the original packet comes with a vlan tag from the VM. The TAP device receives this packet and since the TAG is different than the assigned to the port, it will be dropped. So, the communication in this case does not work for the current setup.
The solution is to use VLAN protocol 802.1q, which will insert another VLAN header without changing the original packet, keeping the original tag as part of the packet. It can even encapsulate multiple VLAN tags in the same packet. This picture describes better what we need to achieve.
The way to achieve this in Openvswitch is to set the port with the option vlan_mode=dot1q-tunnel
[1] https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch#L149
Updated by jlausuch over 4 years ago
- Status changed from New to Resolved
This was resolved already and forgot to close the ticket by that time.