action #14334

job incomplete: "could not configure /dev/net/tun (tap00): Device or resource busy"

Added by okurz over 3 years ago. Updated over 2 years ago.

Status:ResolvedStart date:20/10/2016
Priority:UrgentDue date:
Assignee:asmorodskyi% Done:

0%

Category:Bugs in existing tests
Target version:openQA Project - Milestone 4
Difficulty:
Duration:

Description

observation

t#621407 is incomplete. https://openqa.suse.de/tests/621407/file/autoinst-log.txt shows the error details

DIE Died at /usr/lib/os-autoinst/consoles/vnc_base.pm line 76.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 73.
    backend::baseclass::die_handler('Died at /usr/lib/os-autoinst/consoles/vnc_base.pm line 76.\x{a}') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 76
    consoles::vnc_base::catch {...} ('Error connecting to host <localhost>\x{a}$VAR1 = bless( {\x{a}       ...') called at /usr/lib/perl5/vendor_perl/5.18.2/Try/Tiny.pm line 104
    Try::Tiny::try('CODE(0x61b1c78)', 'Try::Tiny::Catch=REF(0x627a2b8)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 80
    consoles::vnc_base::connect_vnc('consoles::vnc_base=HASH(0x61b9a20)', 'HASH(0x5f79320)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 37
    consoles::vnc_base::activate('consoles::vnc_base=HASH(0x61b9a20)') called at /usr/lib/os-autoinst/consoles/console.pm line 74
    consoles::console::select('consoles::vnc_base=HASH(0x61b9a20)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 469
    backend::baseclass::select_console('backend::qemu=HASH(0x60c04b0)', 'HASH(0x61b99a8)') called at /usr/lib/os-autoinst/backend/qemu.pm line 679
    backend::qemu::start_qemu('backend::qemu=HASH(0x60c04b0)') called at /usr/lib/os-autoinst/backend/qemu.pm line 98
    backend::qemu::do_start_vm('backend::qemu=HASH(0x60c04b0)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 255
    backend::baseclass::start_vm('backend::qemu=HASH(0x60c04b0)', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 68
    backend::baseclass::handle_command('backend::qemu=HASH(0x60c04b0)', 'HASH(0x6161500)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 427
    backend::baseclass::check_socket('backend::qemu=HASH(0x60c04b0)', 'IO::Handle=GLOB(0x608c5b8)') called at /usr/lib/os-autoinst/backend/qemu.pm line 893
    backend::qemu::check_socket('backend::qemu=HASH(0x60c04b0)', 'IO::Handle=GLOB(0x608c5b8)', 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 209
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 171
    backend::baseclass::run_capture_loop('backend::qemu=HASH(0x60c04b0)', 'IO::Select=ARRAY(0x60095e8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 120
    backend::baseclass::run('backend::qemu=HASH(0x60c04b0)', 6, 10) called at /usr/lib/os-autoinst/backend/driver.pm line 85
    backend::driver::start('backend::driver=HASH(0x3fbc450)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
    backend::driver::new('backend::driver', 'qemu') called at /usr/bin/isotovideo line 177
    main::init_backend() called at /usr/bin/isotovideo line 236
05:44:49.2515 13306 waitpid for 13314 returned 13314
05:44:49.2519 13306 QEMU: qemu-system-x86_64: -netdev tap,id=qanet0,ifname=tap00,script=/etc/qemu-ifup-br0,downscript=no: could not configure /dev/net/tun (tap00): Device or resource busy
05:44:49.2520 13306 QEMU: qemu-system-x86_64: -netdev tap,id=qanet0,ifname=tap00,script=/etc/qemu-ifup-br0,downscript=no: Device 'tap' could not be initialized

reproducible

Looking at https://openqa.suse.de/tests?hoursfresh=24&match=hacluster-supportserver shows that this happens a lot but not every time

problem

H1. problem with setting up the tun device
H1.1. tun device fails for high instance worker numbers (see #14334#note-7)
H2. conflict with other jobs accessing the same devices at the same time

workaround

restart seems to help

  1. Look if other instance of supportserver is running (which is using the same tun device)
  2. If this instance is in "zombie modus" (parralels are incomplete, passed, failed, parallel restarted, also the support server is no more necessary for any job) Cancel this zombie instance.
  3. Restart the parallels (the supportserver will be retriggered)

Related issues

Related to openQA Project - action #19432: [multimachine][scheduling] Fail of one multi-machine jobs... Resolved 30/05/2017
Blocks openQA Tests - action #15416: [tools] bridge device seems to have disappeared for HA tests Resolved 16/09/2016

History

#1 Updated by dzedro over 3 years ago

There is no tap00, doesn't exist and is not in /etc/sysconfig/network/ifcfg-br1
Question is how did get the worker the tap00 ?
I restarted worker :3

openqaworker3:~ # ovs-ofctl show br1
OFPT_FEATURES_REPLY (xid=0x2): dpid:00003ef9f690fd43
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: OUTPUT SET_VLAN_VID SET_VLAN_PCP STRIP_VLAN SET_DL_SRC SET_DL_DST SET_NW_SRC SET_NW_DST SET_NW_TOS SET_TP_SRC SET_TP_DST ENQUEUE
 1(tap132): addr:12:2c:6a:ef:1d:64
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 2(tap130): addr:b2:ee:80:5c:79:c2
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 3(tap64): addr:d2:06:4c:bf:72:6d
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 4(tap0): addr:f2:43:0e:5a:21:c0
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 5(tap4): addr:d6:38:c0:8d:27:ce
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 6(tap67): addr:7e:c7:39:24:a0:7c
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 7(tap1): addr:6a:73:2a:e4:fc:ec
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 8(tap65): addr:b6:00:58:79:0a:4f
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 9(tap131): addr:b2:77:86:ff:c9:85
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 10(tap66): addr:82:f6:ad:26:ad:d4
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 11(tap128): addr:4a:19:2e:f7:2e:73
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 12(tap133): addr:c6:2c:11:1c:d8:5c
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 13(tap5): addr:26:77:d5:0e:7e:af
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 14(tap68): addr:72:5c:0a:06:d1:a1
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 15(tap129): addr:76:cb:d9:59:b8:ba
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 16(tap69): addr:0a:ca:86:c6:6d:54
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 17(tap3): addr:6a:62:87:8e:75:3d
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 18(tap2): addr:56:1e:3f:3b:29:40
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 LOCAL(br1): addr:3e:f9:f6:90:fd:43
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0
openqaworker3:~ #

openqaworker3:~ # cat /etc/sysconfig/network/ifcfg-br1 
BOOTPROTO='static'
IPADDR='10.0.2.2/15'
STARTMODE='auto'
OVS_BRIDGE='yes'
OVS_BRIDGE_PORT_DEVICE_1='tap0'
OVS_BRIDGE_PORT_DEVICE_2='tap1'
OVS_BRIDGE_PORT_DEVICE_3='tap2'
OVS_BRIDGE_PORT_DEVICE_4='tap3'
OVS_BRIDGE_PORT_DEVICE_5='tap4'
OVS_BRIDGE_PORT_DEVICE_6='tap5'
OVS_BRIDGE_PORT_DEVICE_8='tap64'
OVS_BRIDGE_PORT_DEVICE_9='tap65'
OVS_BRIDGE_PORT_DEVICE_10='tap66'
OVS_BRIDGE_PORT_DEVICE_11='tap67'
OVS_BRIDGE_PORT_DEVICE_12='tap68'
OVS_BRIDGE_PORT_DEVICE_13='tap69'
OVS_BRIDGE_PORT_DEVICE_15='tap128'
OVS_BRIDGE_PORT_DEVICE_16='tap129'
OVS_BRIDGE_PORT_DEVICE_17='tap130'
OVS_BRIDGE_PORT_DEVICE_18='tap131'
OVS_BRIDGE_PORT_DEVICE_19='tap132'
OVS_BRIDGE_PORT_DEVICE_20='tap133'
openqaworker3:~ #

#2 Updated by okurz over 3 years ago

  • Priority changed from High to Urgent

#4 Updated by okurz about 3 years ago

  • Target version set to Milestone 4

#5 Updated by maritawerner about 3 years ago

  • Assignee set to okurz

#6 Updated by okurz about 3 years ago

  • Related to action #15416: [tools] bridge device seems to have disappeared for HA tests added

#7 Updated by okurz about 3 years ago

#15416 can be related, at least they both fail for the same test suite which is 'hacluster-supportserver'. https://openqa.suse.de/tests?hoursfresh=24&match=hacluster-supportserver shows all recent failures for hacluster-supportserver and we see a lot of incompletes which are apparently either for this ticket here or #15416 (or something else).

To me it looks like jobs for a high number of worker instances fail, low instance numbers succeed. E.g.
* passed: https://openqa.suse.de/tests/672601/file/autoinst-log.txt openqaworker3:1, https://openqa.suse.de/tests/657200/file/autoinst-log.txt openqaworker3:5
* incomplete with the mentioned error: https://openqa.suse.de/tests/669351/file/autoinst-log.txt openqaworker3:10, https://openqa.suse.de/tests/667727/file/autoinst-log.txt openqaworker3:11, https://openqa.suse.de/tests/661373/file/autoinst-log.txt openqaworker3:11
* other reason incompletes: https://openqa.suse.de/tests/662713/file/autoinst-log.txt openqaworker3:5, https://openqa.suse.de/tests/655123/file/autoinst-log.txt openqaworker3:3

so high numbers fail, low numbers work

There was the commit https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/93cf7448aa56a6bb7d403b88ee401c99d5dd9294 which bumped up the version of tap workers for the use of slenkins but that does not seem to fix it for hacluster-supportserver.

#8 Updated by okurz about 3 years ago

  • Description updated (diff)

#9 Updated by okurz about 3 years ago

  • Assignee changed from okurz to nadvornik

hi @nadvornik, can you help with #14334#note-7 please?

#10 Updated by nadvornik about 3 years ago

  • Assignee changed from nadvornik to dzyuzin

The test is configured to use hardcoded TAP device names instead of the normal auto-allocation: TAPDEV = tap00,tap01,tap02
I am not author of the test so I don't know what is the reason for this configuration.

Recently the number of tap workers increased so I guess the failures are because of conflicts when multiple instances of the test run in parallel.
It can be fixed by adding new worker class to this test and to just one worker.

Re-assigning to test maintainer.

#11 Updated by okurz about 3 years ago

  • Related to deleted (action #15416: [tools] bridge device seems to have disappeared for HA tests)

#12 Updated by okurz about 3 years ago

  • Blocks action #15416: [tools] bridge device seems to have disappeared for HA tests added

#13 Updated by asmorodskyi about 3 years ago

  • Status changed from New to In Progress
  • Assignee changed from dzyuzin to asmorodskyi

#14 Updated by asmorodskyi about 3 years ago

  • Status changed from In Progress to Resolved

problem was in misconfiguration of HA test suites , which start to create devices which already created by openvswitch . Solution was to rename interfaces to tapha00 (add 'ha' to device name to make it unique)

#15 Updated by SLindoMansilla over 2 years ago

  • Description updated (diff)

Workaround updated:

  1. Look if other instance of supportserver is running (which is using the same tun device)
  2. If this instance is in "zombie modus" (parralels are incomplete, passed, failed, parallel restarted, also the support server is no more necessary for any job) Cancel this zombie instance.
  3. Restart the parallels (the supportserver will be retriggered)

#16 Updated by SLindoMansilla over 2 years ago

  • Related to action #19432: [multimachine][scheduling] Fail of one multi-machine jobs cause restart all of them without checking state of others added

Also available in: Atom PDF