action #54074
closednetwork tap / bridge setup is broken on some workers ( openqaworker7 )
0%
Description
in some workers I see such error message in os-autoinst log :
Failed to run dbus command 'set_vlan' with arguments 'tap13 2' : 'tap13' is not connected to bridge 'br1'
during test run this cause broken network ( ping not working , no DNS name resolution )
Updated by asmorodskyi almost 5 years ago
another failed job https://openqa.suse.de/tests/3056848
Updated by acarvajal almost 5 years ago
Also seeing many MM network problems with openqaworker7 in the HA group for a week or so.
Initially thought the problem was also present in openqaworker5, but haven't seen failed jobs with openqaworker5 due to network issues since the openqa.suse.de upgrade from July 9th/10th.
Failed tests sample:
1) https://openqa.suse.de/tests/3055962#step/setup/25 ([debug] Failed to run dbus command 'set_vlan' with arguments 'tap19 7' : 'tap19' is not connected to bridge 'br1')
2) https://openqa.suse.de/tests/3071752#step/setup/25 ([debug] Failed to run dbus command 'set_vlan' with arguments 'tap15 3' : 'tap15' is not connected to bridge 'br1')
3) https://openqa.suse.de/tests/3054586#step/upgrade_from_sle11sp4_workarounds/16 ([debug] Failed to run dbus command 'set_vlan' with arguments 'tap14 4' : 'tap14' is not connected to bridge 'br1')
4) https://openqa.suse.de/tests/3071747#step/upgrade_from_sle11sp4_workarounds/16 ([debug] Failed to run dbus command 'set_vlan' with arguments 'tap11 6' : 'tap11' is not connected to bridge 'br1')
5) https://openqa.suse.de/tests/3071758#step/upgrade_from_sle11sp4_workarounds/16 ([debug] Failed to run dbus command 'set_vlan' with arguments 'tap11 3' : 'tap11' is not connected to bridge 'br1')
The last job ran parallel with https://openqa.suse.de/tests/3071756 which also ran in openqaworker7 and doesn't show the 'set_vlan' error in its autoinst.log: https://openqa.suse.de/tests/3071756/file/autoinst-log.txt
It seems that second job used tap3, which looks different when checked with 'ip a' in openqaworker7:
41: tap3: mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
link/ether e2:b9:5f:d0:95:1d brd ff:ff:ff:ff:ff:ff
vs.
11: tap11: mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
link/ether c2:5b:f9:9c:f4:2b brd ff:ff:ff:ff:ff:ff
There's a startup file for both interfaces:
openqaworker7:~ # ls -l /etc/sysconfig/network/ifcfg-tap{3,11}
-rw------- 1 root root 141 Jul 8 15:55 /etc/sysconfig/network/ifcfg-tap11
-rw------- 1 root root 141 Jul 8 15:55 /etc/sysconfig/network/ifcfg-tap3
But tap11 is not present in /etc/sysconfig/network/ifcfg-br1:
openqaworker7:~ # egrep 'tap3|tap11' /etc/sysconfig/network/ifcfg-br1
OVS_BRIDGE_PORT_DEVICE_3='tap3'
In short, it seems that the bridge configuration is non-existent for 30 out 60 tap interfaces:
openqaworker7:~ # grep OVS_BRIDGE_PORT_DEVICE_ /etc/sysconfig/network/ifcfg-br1 |wc -l
30
openqaworker7:~ # ls /etc/sysconfig/network/ifcfg-tap*|wc -l
60
Checking other workers, this is not the case:
openqaworker6:~ # grep OVS_BRIDGE_PORT_DEVICE_ /etc/sysconfig/network/ifcfg-br1 |wc -l
60
openqaworker6:~ # ls /etc/sysconfig/network/ifcfg-tap*|wc -l
60
openqaworker5:~ # grep OVS_BRIDGE_PORT_DEVICE_ /etc/sysconfig/network/ifcfg-br1 |wc -l
66
openqaworker5:~ # ls /etc/sysconfig/network/ifcfg-tap*|wc -l
66
Updated by acarvajal almost 5 years ago
I'm seeing in the salt pillars that 'numofworkers' for openqaworker7 was reduced from 20 to 10 as a consequence of poo#49694.
Since there are 60 interfaces but only half are configured for the bridge, I think this is related.
Updated by jlausuch over 4 years ago
Another occurrence with openqaworker7 : https://openqa.suse.de/tests/3125427#
[2019-07-25T00:27:03.793 CEST] [debug] Failed to run dbus command 'set_vlan' with arguments 'tap10 6' : 'tap10' is not connected to bridge 'br1'
[2019-07-25T00:27:03.834 CEST] [debug] Failed to run dbus command 'set_vlan' with arguments 'tap74 6' : 'tap74' is not connected to bridge 'br1'
Updated by nicksinger over 4 years ago
- Is duplicate of action #54632: [tools] openqaworker7 is causing failures added