action #19806
closedeth0 address of one node is sometime in use by other test in multimachine tests within same ovs VLAN
Added by thehejik over 7 years ago. Updated almost 7 years ago.
0%
Description
https://openqa.suse.de/tests/998491#step/setup/23
in the screen above you can see that we have correct ifcfg-eth0 file with all needed entries but eth0 didn't get an IP after calling rcnetwork restart. Maybe we can do a check in loop and restart network until we have an IP.
Maybe we can also replace rcnetwork restart by "wicked ifdown eth0 && wicked ifup eth0" or by some systemctl call
function responsible for that is mm_network::configure_static_ip
Maybe it is just SP2 product bug.
Updated by thehejik over 7 years ago
/var/lib/openqa/share/tests/opensuse/lib/mm_network.pm
Updated by thehejik over 7 years ago
Basically this is the problem. Most probably it collides with another control node running at the same time with the same IP, but those VMs should be separated by each other by using vlan tags in ovs.
Jun 14 19:43:46 linux-87vr wicked[1187]: eth0 device-ready
Jun 14 19:43:46 linux-87vr wickedd[570]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:01:3b
Jun 14 19:43:46 linux-87vr wickedd[570]: [1;39meth0: address '10.0.2.1' is already in use[0m
Jun 14 19:43:46 linux-87vr wickedd[570]: [1;31m__ni_rtnl_send_newroute(ipv4 0.0.0.0/0 via 10.0.2.2 dev eth0 type unicast table main scope universe protocol boot): ni_nl_talk failed [Unspecific failure][0m
Jun 14 19:43:46 linux-87vr wickedd[570]: [1;31meth0: error updating interface config from ipv4:static lease[0m
Jun 14 19:43:46 linux-87vr wickedd-nanny[576]: [1;31mdevice eth0: call to org.opensuse.Network.Addrconf.ipv4.static.requestLease() failed: General failure[0m
Jun 14 19:43:46 linux-87vr wickedd-nanny[576]: [1;31meth0: failed to bring up device, still continuing[0m
Jun 14 19:43:56 linux-87vr wicked[1328]: lo up
Jun 14 19:43:56 linux-87vr wicked[1328]: eth0 setup-in-progress
Updated by thehejik over 7 years ago
I got same results (eth0 is up but w/o IP) by running 2 support_server VMs within one ovs bridge. Both VM's are preconfigured to use the same Ip 10.0.2.1 and using host's tap adaptors in VLAN marked by tag=99 in ovs-vsctl show - and similarly if no VLAN tag is set - it works normally when VMs and their tap devices are in different VLAN.
Steps to reproduce:
1) setup ovs bridge for mm needs - br1 with tap devices accrding to https://github.com/os-autoinst/openQA/blob/master/docs/Networking.asciidoc
2) manualy set following openflow rule from /usr/lib/os-autoinst/os-autoinst-openvswitch
ovs-ofctl add-flow br1 table=0,priority=0,action=normal
(you can delete all openflow rules by ovs-ofctl del-flows br1
)
3) create two overlay hdd images for two support server VMs:
qemu-img create support_server -f qcow2 -b /var/lib/openqa/share/factory/hdd/openqa_support_server_sles12sp1.x86_64.qcow2
qemu-img create support_server2 -f qcow2 -b /var/lib/openqa/share/factory/hdd/openqa_support_server_sles12sp1.x86_64.qcow2
4) run two VMs using support_server images (use different tap devices, hdd images generated before and MACs):
qemu-system-x86_64 -vga cirrus -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap0,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:aa:17 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=support_server,cache=unsafe,if=none,id=hd1,format=qcow2 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown --nographic
and
qemu-system-x86_64 -vga cirrus -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap2,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:aa:f9 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=support_server2,cache=unsafe,if=none,id=hd1,format=qcow2 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown --nographic
Note: support_server images has already /etc/sysconfig/network/ifcfg-eth0 file present with 10.0.2.1 address in it. But for some reason I'm getting only eth1 - after removing /etc/udev/rules.d/70-persistent-net.rules and restart the eth0 is back and the static address should be set automatically.
5) check ip a output, eth0 interfaces on both VMs are UP but only one VM has an IP. The faulty one VM has following in journalctl log:
Jun 16 07:56:15 linux-87vr wickedd[589]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:aa:f9 <--- MAC of the second VM
Jun 16 07:56:15 linux-87vr wickedd[589]: eth0: address '10.0.2.1' is already in use
Jun 16 07:57:01 linux-87vr wicked[1518]: eth0 device-ready
Jun 16 07:57:02 linux-87vr wickedd[589]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:aa:f9
Jun 16 07:57:02 linux-87vr wickedd[589]: eth0: address '10.0.2.1' is already in use
Jun 16 07:57:32 linux-87vr wicked[1659]: lo up
Jun 16 07:57:32 linux-87vr wicked[1659]: eth0 up
6) the address will be assigned once used tap devices will be marked by VLAN tag with different value, rcnetwork restart on VMs is needed after that.
ovs-vsctl set port tap2 tag=12
(remove the tag by ovs-vsctl remove port tap2 tag 12
)
So I probably know the reason why it fails but don't know the culprit - I need to know in which phase of VM provisioning the tags are configured for host tap devices (~ VLANs). Maybe there is some delay or whatever.
Updated by thehejik over 7 years ago
- Subject changed from restart network until ip a s eth0 show an address in multimachine test to eth0 address of one node is sometime in use by other test in multimachine tests within same ovs VLAN
Updated by thehejik over 7 years ago
Ettore modified handling of dbus service for openvswitch and added better error reporting https://github.com/os-autoinst/os-autoinst/pull/822
New packages are installed on openqaworker3. Let see the output of failed mm tests later.
Updated by coolo over 7 years ago
Does https://openqa.suse.de/tests/1009589#step/setup/25 contain the info you need?
Updated by coolo over 7 years ago
- Assignee set to thehejik
- Priority changed from Normal to High
Updated by coolo over 7 years ago
- Has duplicate action #19276: Slenkins nodes often unable to reach network added
Updated by thehejik over 7 years ago
coolo wrote:
Does https://openqa.suse.de/tests/1009589#step/setup/25 contain the info you need?
Yes, it does. But still it will need more debug.
This postgresql test consists of 3 jobs in total ~ 3 tap devices.
According to NICVLAN=23 from vars.json the test should use VLAN marked by "tag: 23", that's correct - tap7 is really marked by tag 23 in ovs-vsctl show
output.
Problem is that we should have only 3 tap devices in ovs-vsctl show
output marked by "tag: 23" but we have 8 tap devices marked by the same tag.
So most probably there are other tests running at the same time within the same VLAN.
Updated by EDiGiacinto over 7 years ago
AFAICS it might be also a backend issue, looks like it's possible to have a race condition during the vlan tag allocation since the call to insert a new vlan isn't wrapped into a database transaction: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Schema/Result/Jobs.pm#L1244.
It is hard to test/debug since the issue seems to happen sporadically - and it's difficult to reproduce locally.
Possible (yet untested) solution: https://github.com/mudler/openQA/commit/e2b9b7a5b6cfdaf2a8c1e87702c08d227e7abd64.patch
Will try to debug/test on local machine and touch base with thehejik once he comes back from vacation.
Updated by EDiGiacinto over 7 years ago
Patch is included in the (WIP) PR: https://github.com/os-autoinst/openQA/pull/1389 - which cover another possible race condition
Updated by EDiGiacinto over 7 years ago
PR1389 was merged, opened another one to free vlan tags allocation after use: https://github.com/os-autoinst/os-autoinst/pull/828
Updated by coolo about 7 years ago
- Assignee deleted (
thehejik) - Target version changed from Milestone 9 to Ready
did we see this lately?
Updated by EDiGiacinto about 7 years ago
- Status changed from In Progress to Resolved
coolo wrote:
did we see this lately?
Not anymore AFAICS, marking as resolved as agreed on IRC (so reopen with new logs if necessary) we are having another related (sporadic) issue: after a job has been killed, tags are not wiped out
Updated by szarate almost 7 years ago
- Target version changed from Ready to Done