action #14334: job incomplete: "could not configure /dev/net/tun (tap00): Device or resource busy" - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #14334

closed

job incomplete: "could not configure /dev/net/tun (tap00): Device or resource busy"

Added by okurz over 8 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

asmorodskyi

Category:

Bugs in existing tests

Target version:

openQA Project (public) - Milestone 4

Start date:

2016-10-20

Due date:

% Done:

Estimated time:

Difficulty:

Description

observation¶

t#621407 is incomplete. https://openqa.suse.de/tests/621407/file/autoinst-log.txt shows the error details

DIE Died at /usr/lib/os-autoinst/consoles/vnc_base.pm line 76.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 73.
	backend::baseclass::die_handler('Died at /usr/lib/os-autoinst/consoles/vnc_base.pm line 76.\x{a}') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 76
	consoles::vnc_base::catch {...} ('Error connecting to host <localhost>\x{a}$VAR1 = bless( {\x{a}       ...') called at /usr/lib/perl5/vendor_perl/5.18.2/Try/Tiny.pm line 104
	Try::Tiny::try('CODE(0x61b1c78)', 'Try::Tiny::Catch=REF(0x627a2b8)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 80
	consoles::vnc_base::connect_vnc('consoles::vnc_base=HASH(0x61b9a20)', 'HASH(0x5f79320)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 37
	consoles::vnc_base::activate('consoles::vnc_base=HASH(0x61b9a20)') called at /usr/lib/os-autoinst/consoles/console.pm line 74
	consoles::console::select('consoles::vnc_base=HASH(0x61b9a20)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 469
	backend::baseclass::select_console('backend::qemu=HASH(0x60c04b0)', 'HASH(0x61b99a8)') called at /usr/lib/os-autoinst/backend/qemu.pm line 679
	backend::qemu::start_qemu('backend::qemu=HASH(0x60c04b0)') called at /usr/lib/os-autoinst/backend/qemu.pm line 98
	backend::qemu::do_start_vm('backend::qemu=HASH(0x60c04b0)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 255
	backend::baseclass::start_vm('backend::qemu=HASH(0x60c04b0)', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 68
	backend::baseclass::handle_command('backend::qemu=HASH(0x60c04b0)', 'HASH(0x6161500)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 427
	backend::baseclass::check_socket('backend::qemu=HASH(0x60c04b0)', 'IO::Handle=GLOB(0x608c5b8)') called at /usr/lib/os-autoinst/backend/qemu.pm line 893
	backend::qemu::check_socket('backend::qemu=HASH(0x60c04b0)', 'IO::Handle=GLOB(0x608c5b8)', 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 209
	eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 171
	backend::baseclass::run_capture_loop('backend::qemu=HASH(0x60c04b0)', 'IO::Select=ARRAY(0x60095e8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 120
	backend::baseclass::run('backend::qemu=HASH(0x60c04b0)', 6, 10) called at /usr/lib/os-autoinst/backend/driver.pm line 85
	backend::driver::start('backend::driver=HASH(0x3fbc450)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
	backend::driver::new('backend::driver', 'qemu') called at /usr/bin/isotovideo line 177
	main::init_backend() called at /usr/bin/isotovideo line 236
05:44:49.2515 13306 waitpid for 13314 returned 13314
05:44:49.2519 13306 QEMU: qemu-system-x86_64: -netdev tap,id=qanet0,ifname=tap00,script=/etc/qemu-ifup-br0,downscript=no: could not configure /dev/net/tun (tap00): Device or resource busy
05:44:49.2520 13306 QEMU: qemu-system-x86_64: -netdev tap,id=qanet0,ifname=tap00,script=/etc/qemu-ifup-br0,downscript=no: Device 'tap' could not be initialized

reproducible¶

Looking at https://openqa.suse.de/tests?hoursfresh=24&match=hacluster-supportserver shows that this happens a lot but not every time

problem¶

H1. problem with setting up the tun device
H1.1. tun device fails for high instance worker numbers (see #14334#note-7)
H2. conflict with other jobs accessing the same devices at the same time

workaround¶

restart seems to help

Look if other instance of supportserver is running (which is using the same tun device)
If this instance is in "zombie modus" (parralels are incomplete, passed, failed, parallel restarted, also the support server is no more necessary for any job)
Cancel this zombie instance.
Restart the parallels (the supportserver will be retriggered)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by dzedro over 8 years ago

There is no tap00, doesn't exist and is not in /etc/sysconfig/network/ifcfg-br1
Question is how did get the worker the tap00 ?
I restarted worker :3

openqaworker3:~ # ovs-ofctl show br1
OFPT_FEATURES_REPLY (xid=0x2): dpid:00003ef9f690fd43
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: OUTPUT SET_VLAN_VID SET_VLAN_PCP STRIP_VLAN SET_DL_SRC SET_DL_DST SET_NW_SRC SET_NW_DST SET_NW_TOS SET_TP_SRC SET_TP_DST ENQUEUE
 1(tap132): addr:12:2c:6a:ef:1d:64
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 2(tap130): addr:b2:ee:80:5c:79:c2
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 3(tap64): addr:d2:06:4c:bf:72:6d
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 4(tap0): addr:f2:43:0e:5a:21:c0
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 5(tap4): addr:d6:38:c0:8d:27:ce
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 6(tap67): addr:7e:c7:39:24:a0:7c
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 7(tap1): addr:6a:73:2a:e4:fc:ec
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 8(tap65): addr:b6:00:58:79:0a:4f
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 9(tap131): addr:b2:77:86:ff:c9:85
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 10(tap66): addr:82:f6:ad:26:ad:d4
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 11(tap128): addr:4a:19:2e:f7:2e:73
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 12(tap133): addr:c6:2c:11:1c:d8:5c
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 13(tap5): addr:26:77:d5:0e:7e:af
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 14(tap68): addr:72:5c:0a:06:d1:a1
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 15(tap129): addr:76:cb:d9:59:b8:ba
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 16(tap69): addr:0a:ca:86:c6:6d:54
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 17(tap3): addr:6a:62:87:8e:75:3d
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 18(tap2): addr:56:1e:3f:3b:29:40
     config:     0
     state:      LINK_DOWN
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 LOCAL(br1): addr:3e:f9:f6:90:fd:43
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0
openqaworker3:~ #

openqaworker3:~ # cat /etc/sysconfig/network/ifcfg-br1 
BOOTPROTO='static'
IPADDR='10.0.2.2/15'
STARTMODE='auto'
OVS_BRIDGE='yes'
OVS_BRIDGE_PORT_DEVICE_1='tap0'
OVS_BRIDGE_PORT_DEVICE_2='tap1'
OVS_BRIDGE_PORT_DEVICE_3='tap2'
OVS_BRIDGE_PORT_DEVICE_4='tap3'
OVS_BRIDGE_PORT_DEVICE_5='tap4'
OVS_BRIDGE_PORT_DEVICE_6='tap5'
OVS_BRIDGE_PORT_DEVICE_8='tap64'
OVS_BRIDGE_PORT_DEVICE_9='tap65'
OVS_BRIDGE_PORT_DEVICE_10='tap66'
OVS_BRIDGE_PORT_DEVICE_11='tap67'
OVS_BRIDGE_PORT_DEVICE_12='tap68'
OVS_BRIDGE_PORT_DEVICE_13='tap69'
OVS_BRIDGE_PORT_DEVICE_15='tap128'
OVS_BRIDGE_PORT_DEVICE_16='tap129'
OVS_BRIDGE_PORT_DEVICE_17='tap130'
OVS_BRIDGE_PORT_DEVICE_18='tap131'
OVS_BRIDGE_PORT_DEVICE_19='tap132'
OVS_BRIDGE_PORT_DEVICE_20='tap133'
openqaworker3:~ #

Actions

Copy link

Updated by okurz over 8 years ago

Priority changed from High to Urgent

recent example: https://openqa.suse.de/tests/637275/file/autoinst-log.txt

Actions

Copy link

Updated by dzedro over 8 years ago

latest hacluster-supportserver

Actions

Copy link

Updated by okurz over 8 years ago

Target version set to Milestone 4

Actions

Copy link

Updated by maritawerner over 8 years ago

Assignee set to okurz

Actions

Copy link

Updated by okurz over 8 years ago

Related to action #15416: [tools] bridge device seems to have disappeared for HA tests added

Actions

Copy link

Updated by okurz over 8 years ago

#15416 can be related, at least they both fail for the same test suite which is 'hacluster-supportserver'. https://openqa.suse.de/tests?hoursfresh=24&match=hacluster-supportserver shows all recent failures for hacluster-supportserver and we see a lot of incompletes which are apparently either for this ticket here or #15416 (or something else).

To me it looks like jobs for a high number of worker instances fail, low instance numbers succeed. E.g.

passed: https://openqa.suse.de/tests/672601/file/autoinst-log.txt openqaworker3:1, https://openqa.suse.de/tests/657200/file/autoinst-log.txt openqaworker3:5
incomplete with the mentioned error: https://openqa.suse.de/tests/669351/file/autoinst-log.txt openqaworker3:10, https://openqa.suse.de/tests/667727/file/autoinst-log.txt openqaworker3:11, https://openqa.suse.de/tests/661373/file/autoinst-log.txt openqaworker3:11
other reason incompletes: https://openqa.suse.de/tests/662713/file/autoinst-log.txt openqaworker3:5, https://openqa.suse.de/tests/655123/file/autoinst-log.txt openqaworker3:3

so high numbers fail, low numbers work

There was the commit https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/93cf7448aa56a6bb7d403b88ee401c99d5dd9294 which bumped up the version of tap workers for the use of slenkins but that does not seem to fix it for hacluster-supportserver.

Actions

Copy link

Updated by okurz over 8 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 8 years ago

Assignee changed from okurz to nadvornik

hi @nadvornik, can you help with #14334#note-7 please?

Actions

Copy link

#10

Updated by nadvornik over 8 years ago

Assignee changed from nadvornik to dzyuzin

The test is configured to use hardcoded TAP device names instead of the normal auto-allocation: TAPDEV = tap00,tap01,tap02
I am not author of the test so I don't know what is the reason for this configuration.

Recently the number of tap workers increased so I guess the failures are because of conflicts when multiple instances of the test run in parallel.
It can be fixed by adding new worker class to this test and to just one worker.

Re-assigning to test maintainer.

Actions

Copy link

#11

Updated by okurz over 8 years ago

Related to deleted (action #15416: [tools] bridge device seems to have disappeared for HA tests)

Actions

Copy link

#12

Updated by okurz over 8 years ago

Blocks action #15416: [tools] bridge device seems to have disappeared for HA tests added

Actions

Copy link

#13

Updated by asmorodskyi over 8 years ago

Status changed from New to In Progress
Assignee changed from dzyuzin to asmorodskyi

Actions

Copy link

#14

Updated by asmorodskyi over 8 years ago

Status changed from In Progress to Resolved

problem was in misconfiguration of HA test suites , which start to create devices which already created by openvswitch . Solution was to rename interfaces to tapha00 (add 'ha' to device name to make it unique)

Actions

Copy link

#15

Updated by SLindoMansilla almost 8 years ago

Description updated (diff)

Workaround updated:

Look if other instance of supportserver is running (which is using the same tun device)
If this instance is in "zombie modus" (parralels are incomplete, passed, failed, parallel restarted, also the support server is no more necessary for any job)
Cancel this zombie instance.
Restart the parallels (the supportserver will be retriggered)

Actions

Copy link

#16

Updated by SLindoMansilla almost 8 years ago

Related to action #19432: [multimachine][scheduling] Fail of one multi-machine jobs cause restart all of them without checking state of others added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #14334

job incomplete: "could not configure /dev/net/tun (tap00): Device or resource busy"

observation¶

reproducible¶

problem¶

workaround¶

Updated by dzedro over 8 years ago

Updated by okurz over 8 years ago

Updated by dzedro over 8 years ago

Updated by okurz over 8 years ago

Updated by maritawerner over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by nadvornik over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by asmorodskyi over 8 years ago

Updated by asmorodskyi over 8 years ago

Updated by SLindoMansilla almost 8 years ago

Updated by SLindoMansilla almost 8 years ago