action #19806: eth0 address of one node is sometime in use by other test in multimachine tests within same ovs VLAN - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #19806

closed

eth0 address of one node is sometime in use by other test in multimachine tests within same ovs VLAN

Added by thehejik almost 8 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Feature requests

Target version:

Done

Start date:

2017-06-13

Due date:

% Done:

Estimated time:

Description

https://openqa.suse.de/tests/998491#step/setup/23

in the screen above you can see that we have correct ifcfg-eth0 file with all needed entries but eth0 didn't get an IP after calling rcnetwork restart. Maybe we can do a check in loop and restart network until we have an IP.

Maybe we can also replace rcnetwork restart by "wicked ifdown eth0 && wicked ifup eth0" or by some systemctl call

function responsible for that is mm_network::configure_static_ip

Maybe it is just SP2 product bug.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by thehejik almost 8 years ago

/var/lib/openqa/share/tests/opensuse/lib/mm_network.pm

Actions

Copy link

Updated by thehejik almost 8 years ago

Basically this is the problem. Most probably it collides with another control node running at the same time with the same IP, but those VMs should be separated by each other by using vlan tags in ovs.

Jun 14 19:43:46 linux-87vr wicked[1187]: eth0 device-ready
Jun 14 19:43:46 linux-87vr wickedd[570]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:01:3b
Jun 14 19:43:46 linux-87vr wickedd[570]: [1;39meth0: address '10.0.2.1' is already in use[0m
Jun 14 19:43:46 linux-87vr wickedd[570]: [1;31m__ni_rtnl_send_newroute(ipv4 0.0.0.0/0 via 10.0.2.2 dev eth0 type unicast table main scope universe protocol boot): ni_nl_talk failed [Unspecific failure][0m
Jun 14 19:43:46 linux-87vr wickedd[570]: [1;31meth0: error updating interface config from ipv4:static lease[0m
Jun 14 19:43:46 linux-87vr wickedd-nanny[576]: [1;31mdevice eth0: call to org.opensuse.Network.Addrconf.ipv4.static.requestLease() failed: General failure[0m
Jun 14 19:43:46 linux-87vr wickedd-nanny[576]: [1;31meth0: failed to bring up device, still continuing[0m
Jun 14 19:43:56 linux-87vr wicked[1328]: lo up
Jun 14 19:43:56 linux-87vr wicked[1328]: eth0 setup-in-progress

Actions

Copy link

Updated by thehejik almost 8 years ago

I got same results (eth0 is up but w/o IP) by running 2 support_server VMs within one ovs bridge. Both VM's are preconfigured to use the same Ip 10.0.2.1 and using host's tap adaptors in VLAN marked by tag=99 in ovs-vsctl show - and similarly if no VLAN tag is set - it works normally when VMs and their tap devices are in different VLAN.

Steps to reproduce:
1) setup ovs bridge for mm needs - br1 with tap devices accrding to https://github.com/os-autoinst/openQA/blob/master/docs/Networking.asciidoc

2) manualy set following openflow rule from /usr/lib/os-autoinst/os-autoinst-openvswitch

ovs-ofctl add-flow br1 table=0,priority=0,action=normal

(you can delete all openflow rules by ovs-ofctl del-flows br1)

3) create two overlay hdd images for two support server VMs:

qemu-img create support_server -f qcow2 -b /var/lib/openqa/share/factory/hdd/openqa_support_server_sles12sp1.x86_64.qcow2

qemu-img create support_server2 -f qcow2 -b /var/lib/openqa/share/factory/hdd/openqa_support_server_sles12sp1.x86_64.qcow2

4) run two VMs using support_server images (use different tap devices, hdd images generated before and MACs):

qemu-system-x86_64 -vga cirrus -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap0,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:aa:17 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=support_server,cache=unsafe,if=none,id=hd1,format=qcow2 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown --nographic

and

qemu-system-x86_64 -vga cirrus -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap2,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:aa:f9 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=support_server2,cache=unsafe,if=none,id=hd1,format=qcow2 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown --nographic

Note: support_server images has already /etc/sysconfig/network/ifcfg-eth0 file present with 10.0.2.1 address in it. But for some reason I'm getting only eth1 - after removing /etc/udev/rules.d/70-persistent-net.rules and restart the eth0 is back and the static address should be set automatically.

5) check ip a output, eth0 interfaces on both VMs are UP but only one VM has an IP. The faulty one VM has following in journalctl log:

Jun 16 07:56:15 linux-87vr wickedd[589]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:aa:f9 <--- MAC of the second VM
Jun 16 07:56:15 linux-87vr wickedd[589]: eth0: address '10.0.2.1' is already in use
Jun 16 07:57:01 linux-87vr wicked[1518]: eth0 device-ready
Jun 16 07:57:02 linux-87vr wickedd[589]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:aa:f9
Jun 16 07:57:02 linux-87vr wickedd[589]: eth0: address '10.0.2.1' is already in use
Jun 16 07:57:32 linux-87vr wicked[1659]: lo up
Jun 16 07:57:32 linux-87vr wicked[1659]: eth0 up

6) the address will be assigned once used tap devices will be marked by VLAN tag with different value, rcnetwork restart on VMs is needed after that.

ovs-vsctl set port tap2 tag=12
(remove the tag by ovs-vsctl remove port tap2 tag 12)

So I probably know the reason why it fails but don't know the culprit - I need to know in which phase of VM provisioning the tags are configured for host tap devices (~ VLANs). Maybe there is some delay or whatever.

Actions

Copy link

Updated by thehejik almost 8 years ago

Subject changed from restart network until ip a s eth0 show an address in multimachine test to eth0 address of one node is sometime in use by other test in multimachine tests within same ovs VLAN

Actions

Copy link

Updated by thehejik almost 8 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by thehejik almost 8 years ago

Ettore modified handling of dbus service for openvswitch and added better error reporting https://github.com/os-autoinst/os-autoinst/pull/822

New packages are installed on openqaworker3. Let see the output of failed mm tests later.

Actions

Copy link

Updated by coolo almost 8 years ago

Does https://openqa.suse.de/tests/1009589#step/setup/25 contain the info you need?

Actions

Copy link

Updated by coolo almost 8 years ago

Assignee set to thehejik
Priority changed from Normal to High

Actions

Copy link

Updated by coolo almost 8 years ago

Has duplicate action #19276: Slenkins nodes often unable to reach network added

Actions

Copy link

#10

Updated by thehejik almost 8 years ago

coolo wrote:

Does https://openqa.suse.de/tests/1009589#step/setup/25 contain the info you need?

Yes, it does. But still it will need more debug.

This postgresql test consists of 3 jobs in total ~ 3 tap devices.

According to NICVLAN=23 from vars.json the test should use VLAN marked by "tag: 23", that's correct - tap7 is really marked by tag 23 in ovs-vsctl show output.

Problem is that we should have only 3 tap devices in ovs-vsctl show output marked by "tag: 23" but we have 8 tap devices marked by the same tag.

So most probably there are other tests running at the same time within the same VLAN.

Actions

Copy link

#11

Updated by EDiGiacinto almost 8 years ago

AFAICS it might be also a backend issue, looks like it's possible to have a race condition during the vlan tag allocation since the call to insert a new vlan isn't wrapped into a database transaction: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Schema/Result/Jobs.pm#L1244.
It is hard to test/debug since the issue seems to happen sporadically - and it's difficult to reproduce locally.

Possible (yet untested) solution: https://github.com/mudler/openQA/commit/e2b9b7a5b6cfdaf2a8c1e87702c08d227e7abd64.patch

Will try to debug/test on local machine and touch base with thehejik once he comes back from vacation.

Actions

Copy link

#12

Updated by thehejik over 7 years ago

Target version set to Milestone 9

Actions

Copy link

#13

Updated by EDiGiacinto over 7 years ago

Patch is included in the (WIP) PR: https://github.com/os-autoinst/openQA/pull/1389 - which cover another possible race condition

Actions

Copy link

#14

Updated by EDiGiacinto over 7 years ago

PR1389 was merged, opened another one to free vlan tags allocation after use: https://github.com/os-autoinst/os-autoinst/pull/828

Actions

Copy link

#15

Updated by coolo over 7 years ago

Assignee deleted (~~thehejik~~)
Target version changed from Milestone 9 to Ready

did we see this lately?

Actions

Copy link

#16

Updated by EDiGiacinto over 7 years ago

Status changed from In Progress to Resolved

coolo wrote:

did we see this lately?

Not anymore AFAICS, marking as resolved as agreed on IRC (so reopen with new logs if necessary) we are having another related (sporadic) issue: after a job has been killed, tags are not wiped out

Actions

Copy link

#17

Updated by szarate about 7 years ago

Target version changed from Ready to Done

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #19806

eth0 address of one node is sometime in use by other test in multimachine tests within same ovs VLAN

Updated by thehejik almost 8 years ago

Updated by thehejik almost 8 years ago

Updated by thehejik almost 8 years ago

Updated by thehejik almost 8 years ago

Updated by thehejik almost 8 years ago

Updated by thehejik almost 8 years ago

Updated by coolo almost 8 years ago

Updated by coolo almost 8 years ago

Updated by coolo almost 8 years ago

Updated by thehejik almost 8 years ago

Updated by EDiGiacinto almost 8 years ago

Updated by thehejik over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by coolo over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by szarate about 7 years ago