action #35895

[os-autoinst-openvswitch][theory] problem with assigining VLANs tags - more testsuites are using the same VLAN?

Added by thehejik almost 2 years ago. Updated over 1 year ago.

Status:ResolvedStart date:04/05/2018
Priority:NormalDue date:
Assignee:EDiGiacinto% Done:

0%

Category:-
Target version:Done
Difficulty:
Duration:

Description

Anton noticed an issue when support_server failed due network problems - ss leased 10.0.2.22 most probably from different ss in the same VLAN.

https://openqa.suse.de/tests/1667461 failed support_server oqw3 tap4 3.5. 20:33(+2 on worker)
https://openqa.suse.de/tests/1667530 parallel_failed master oqw7 tap12 3.5. 20:33
https://openqa.suse.de/tests/1667523 parallel_failed slave oqw6 tap0 3.5. 20:33

So I did a check on oqw{3,7,6} and found a potential issue that on oqw7 we have more than one tap device assigned to VLAN tag=50.

Anton has added OVS_DEBUG=1 to his testsuites so with another occurrence of this issue we can debug it in autoinst-log.txt (and see if some foreign taps are assigned to the same VLAN).

note to myself ... determine VLAN tag, see test start time => sudo journalctl -u os-autoinst-openvswitch | grep -e "tag.*50"

thehejik@openqaworker7:~> sudo journalctl -u os-autoinst-openvswitch | grep -e "tag.*50"
kvě 03 22:29:50 openqaworker7 ovs-vsctl[16033]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap12 tag=50
kvě 03 22:29:51 openqaworker7 ovs-vsctl[16058]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap13 tag=50
kvě 03 22:33:18 openqaworker7 ovs-vsctl[16899]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap12 tag=50
kvě 03 22:51:39 openqaworker7 ovs-vsctl[24378]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap9 tag=50
kvě 03 22:52:32 openqaworker7 ovs-vsctl[24896]: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove port tap9 tag 50

Related problem can be that VLAN tags are probably not removed when the test fail.

History

#1 Updated by okurz over 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: hpc_ganglia_supportserver
https://openqa.suse.de/tests/1728938

#2 Updated by mloviska over 1 year ago

Similar issue happen in following run https://openqa.suse.de/tests/1777426.

output from serial0.txt:

Jun 21 15:36:40 susetest systemd[1]: Starting wicked managed network interfaces...
Jun 21 15:36:40 susetest wickedd[964]: eth0: IP address 10.0.2.1 is in use by 52:54:00:12:03:9e
Jun 21 15:36:40 susetest wickedd[964]: eth0: address '10.0.2.1' is already in use
Jun 21 15:36:40 susetest wickedd[964]: __ni_rtnl_send_newroute(ipv4 0.0.0.0/0 via 10.0.2.2 dev eth0 type unicast table main scope universe protocol boot): ni_nl_talk failed [Unspecific failure]
Jun 21 15:37:10 susetest wicked[1718]: lo up
Jun 21 15:37:10 susetest wicked[1718]: eth0 device-not-running
Jun 21 15:37:10 susetest systemd[1]: Started wicked managed network interfaces.

Some tap devices have assigned vlan tags which are still in use.

Jun 22 09:05:10 openqaworker6 ovs-vsctl[14369]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap6 tag=39
Jun 22 09:05:11 openqaworker6 ovs-vsctl[14373]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap1 tag=54
Jun 22 09:07:16 openqaworker6 ovs-vsctl[15211]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap4 tag=42
Jun 22 09:07:16 openqaworker6 ovs-vsctl[15214]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap68 tag=81
Jun 22 09:07:16 openqaworker6 ovs-vsctl[15217]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap132 tag=41
Jun 22 09:19:39 openqaworker6 ovs-vsctl[21624]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap0 tag=17
Jun 22 09:22:02 openqaworker6 ovs-vsctl[23163]: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove port tap0 tag 17
Jun 22 09:24:39 openqaworker6 ovs-vsctl[24487]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap0 tag=39
Jun 22 09:24:40 openqaworker6 ovs-vsctl[24517]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap1 tag=41
Jun 22 09:24:41 openqaworker6 ovs-vsctl[24555]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap16 tag=17
Jun 22 09:24:41 openqaworker6 ovs-vsctl[24561]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap2 tag=82

#3 Updated by EDiGiacinto over 1 year ago

  • Status changed from New to Feedback
  • Assignee set to EDiGiacinto
  • Target version set to Current Sprint

We changed in https://github.com/os-autoinst/os-autoinst/pull/1006 'when' the vlan are untagged - now it's tied to the qemu process lifespan, but i need feedback on it (once it is deployed) since it will likely require other changes to cover edge cases.

#4 Updated by szarate over 1 year ago

I think we haven't seen this one for a while

#5 Updated by thehejik over 1 year ago

we still encounter with this, see https://openqa.suse.de/tests/2029994/file/serial0.txt

Sep 06 08:41:56 susetest wickedd[987]: eth0: IPv4 duplicate address 10.0.2.1 detected (in use by 52:54:00:12:03:c7)!
Sep 06 08:41:56 susetest wickedd[987]: __ni_rtnl_send_newroute(ipv4 0.0.0.0/0 via 10.0.2.2 dev eth0 type unicast table main scope universe protocol boot): ni_nl_talk failed [Unspecific failure]

#6 Updated by asmorodskyi over 1 year ago

another job with same failure - https://openqa.suse.de/tests/2030015

#7 Updated by coolo over 1 year ago

  • Status changed from Feedback to Resolved

(meeting protocol :)
We fixed 2 things:
- fixed the vlan assignment on scheduler side to make sure they don't collide between multimachine clusters
- fixed os-autoinst to kill the vlan when the qemu process is done

We think this will reduce the impact of the problem drastically, but the general solution
to this race might still be needed. But we need to see how big the problem remains to be before
we judge when to put worktime into it.

#8 Updated by coolo over 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF