action #19806

eth0 address of one node is sometime in use by other test in multimachine tests within same ovs VLAN

Added by thehejik over 2 years ago. Updated almost 2 years ago.

Status:ResolvedStart date:13/06/2017
Priority:HighDue date:
Assignee:-% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

https://openqa.suse.de/tests/998491#step/setup/23

in the screen above you can see that we have correct ifcfg-eth0 file with all needed entries but eth0 didn't get an IP after calling rcnetwork restart. Maybe we can do a check in loop and restart network until we have an IP.

Maybe we can also replace rcnetwork restart by "wicked ifdown eth0 && wicked ifup eth0" or by some systemctl call

function responsible for that is mm_network::configure_static_ip

Maybe it is just SP2 product bug.


Related issues

Duplicated by openQA Tests - action #19276: Slenkins nodes often unable to reach network Closed 20/05/2017

History

#1 Updated by thehejik over 2 years ago

/var/lib/openqa/share/tests/opensuse/lib/mm_network.pm

#2 Updated by thehejik over 2 years ago

Basically this is the problem. Most probably it collides with another control node running at the same time with the same IP, but those VMs should be separated by each other by using vlan tags in ovs.

Jun 14 19:43:46 linux-87vr wicked[1187]: eth0 device-ready
Jun 14 19:43:46 linux-87vr wickedd[570]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:01:3b
Jun 14 19:43:46 linux-87vr wickedd[570]: eth0: address '10.0.2.1' is already in use
Jun 14 19:43:46 linux-87vr wickedd[570]: __ni_rtnl_send_newroute(ipv4 0.0.0.0/0 via 10.0.2.2 dev eth0 type unicast table main scope universe protocol boot): ni_nl_talk failed [Unspecific failure]
Jun 14 19:43:46 linux-87vr wickedd[570]: eth0: error updating interface config from ipv4:static lease
Jun 14 19:43:46 linux-87vr wickedd-nanny[576]: device eth0: call to org.opensuse.Network.Addrconf.ipv4.static.requestLease() failed: General failure
Jun 14 19:43:46 linux-87vr wickedd-nanny[576]: eth0: failed to bring up device, still continuing
Jun 14 19:43:56 linux-87vr wicked[1328]: lo up
Jun 14 19:43:56 linux-87vr wicked[1328]: eth0 setup-in-progress

#3 Updated by thehejik over 2 years ago

I got same results (eth0 is up but w/o IP) by running 2 support_server VMs within one ovs bridge. Both VM's are preconfigured to use the same Ip 10.0.2.1 and using host's tap adaptors in VLAN marked by tag=99 in ovs-vsctl show - and similarly if no VLAN tag is set - it works normally when VMs and their tap devices are in different VLAN.

Steps to reproduce:
1) setup ovs bridge for mm needs - br1 with tap devices accrding to https://github.com/os-autoinst/openQA/blob/master/docs/Networking.asciidoc

2) manualy set following openflow rule from /usr/lib/os-autoinst/os-autoinst-openvswitch

ovs-ofctl add-flow br1 table=0,priority=0,action=normal

(you can delete all openflow rules by ovs-ofctl del-flows br1)

3) create two overlay hdd images for two support server VMs:

qemu-img create support_server -f qcow2 -b /var/lib/openqa/share/factory/hdd/openqa_support_server_sles12sp1.x86_64.qcow2

qemu-img create support_server2 -f qcow2 -b /var/lib/openqa/share/factory/hdd/openqa_support_server_sles12sp1.x86_64.qcow2

4) run two VMs using support_server images (use different tap devices, hdd images generated before and MACs):

qemu-system-x86_64 -vga cirrus -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap0,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:aa:17 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=support_server,cache=unsafe,if=none,id=hd1,format=qcow2 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown --nographic

and

qemu-system-x86_64 -vga cirrus -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap2,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:aa:f9 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=support_server2,cache=unsafe,if=none,id=hd1,format=qcow2 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown --nographic

Note: support_server images has already /etc/sysconfig/network/ifcfg-eth0 file present with 10.0.2.1 address in it. But for some reason I'm getting only eth1 - after removing /etc/udev/rules.d/70-persistent-net.rules and restart the eth0 is back and the static address should be set automatically.

5) check ip a output, eth0 interfaces on both VMs are UP but only one VM has an IP. The faulty one VM has following in journalctl log:

Jun 16 07:56:15 linux-87vr wickedd[589]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:aa:f9 <--- MAC of the second VM
Jun 16 07:56:15 linux-87vr wickedd[589]: eth0: address '10.0.2.1' is already in use
Jun 16 07:57:01 linux-87vr wicked[1518]: eth0 device-ready
Jun 16 07:57:02 linux-87vr wickedd[589]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:aa:f9
Jun 16 07:57:02 linux-87vr wickedd[589]: eth0: address '10.0.2.1' is already in use
Jun 16 07:57:32 linux-87vr wicked[1659]: lo up
Jun 16 07:57:32 linux-87vr wicked[1659]: eth0 up

6) the address will be assigned once used tap devices will be marked by VLAN tag with different value, rcnetwork restart on VMs is needed after that.

ovs-vsctl set port tap2 tag=12
(remove the tag by ovs-vsctl remove port tap2 tag 12)

So I probably know the reason why it fails but don't know the culprit - I need to know in which phase of VM provisioning the tags are configured for host tap devices (~ VLANs). Maybe there is some delay or whatever.

#4 Updated by thehejik over 2 years ago

  • Subject changed from restart network until ip a s eth0 show an address in multimachine test to eth0 address of one node is sometime in use by other test in multimachine tests within same ovs VLAN

#5 Updated by thehejik over 2 years ago

  • Status changed from New to In Progress

#6 Updated by thehejik over 2 years ago

Ettore modified handling of dbus service for openvswitch and added better error reporting https://github.com/os-autoinst/os-autoinst/pull/822

New packages are installed on openqaworker3. Let see the output of failed mm tests later.

#7 Updated by coolo over 2 years ago

#8 Updated by coolo over 2 years ago

  • Assignee set to thehejik
  • Priority changed from Normal to High

#9 Updated by coolo over 2 years ago

  • Duplicated by action #19276: Slenkins nodes often unable to reach network added

#10 Updated by thehejik over 2 years ago

coolo wrote:

Does https://openqa.suse.de/tests/1009589#step/setup/25 contain the info you need?

Yes, it does. But still it will need more debug.

This postgresql test consists of 3 jobs in total ~ 3 tap devices.

According to NICVLAN=23 from vars.json the test should use VLAN marked by "tag: 23", that's correct - tap7 is really marked by tag 23 in ovs-vsctl show output.

Problem is that we should have only 3 tap devices in ovs-vsctl show output marked by "tag: 23" but we have 8 tap devices marked by the same tag.

So most probably there are other tests running at the same time within the same VLAN.

#11 Updated by EDiGiacinto over 2 years ago

AFAICS it might be also a backend issue, looks like it's possible to have a race condition during the vlan tag allocation since the call to insert a new vlan isn't wrapped into a database transaction: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Schema/Result/Jobs.pm#L1244.
It is hard to test/debug since the issue seems to happen sporadically - and it's difficult to reproduce locally.

Possible (yet untested) solution: https://github.com/mudler/openQA/commit/e2b9b7a5b6cfdaf2a8c1e87702c08d227e7abd64.patch

Will try to debug/test on local machine and touch base with thehejik once he comes back from vacation.

#12 Updated by thehejik over 2 years ago

  • Target version set to Milestone 9

#13 Updated by EDiGiacinto over 2 years ago

Patch is included in the (WIP) PR: https://github.com/os-autoinst/openQA/pull/1389 - which cover another possible race condition

#14 Updated by EDiGiacinto over 2 years ago

PR1389 was merged, opened another one to free vlan tags allocation after use: https://github.com/os-autoinst/os-autoinst/pull/828

#15 Updated by coolo over 2 years ago

  • Assignee deleted (thehejik)
  • Target version changed from Milestone 9 to Ready

did we see this lately?

#16 Updated by EDiGiacinto over 2 years ago

  • Status changed from In Progress to Resolved

coolo wrote:

did we see this lately?

Not anymore AFAICS, marking as resolved as agreed on IRC (so reopen with new logs if necessary) we are having another related (sporadic) issue: after a job has been killed, tags are not wiped out

#17 Updated by szarate almost 2 years ago

  • Target version changed from Ready to Done

Also available in: Atom PDF