action #60161
closed[network][qem] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link
0%
Description
Observation¶
openQA test in scenario sle-15-Server-DVD-Updates-x86_64-qam_wicked_aggregate_sut@64bit incompletes after t19_teaming_ab_nsna_ping . From autoinst-log.txt:
[2019-11-21T15:25:19.328 CET] [debug] ||| finished t19_teaming_ab_nsna_ping wicked/aggregate/sut at 2019-11-21 14:25:19 (84 s)
[2019-11-21T15:25:19.330 CET] [debug] Loading a VM snapshot lastgood
[2019-11-21T15:25:19.330 CET] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[0]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":600175,"seconds":1574346237}}
[2019-11-21T15:25:19.330 CET] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[1]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":657272,"seconds":1574346237}}
[2019-11-21T15:25:19.331 CET] [debug] Loading snapshot (Current VM state is running).
[2019-11-21T15:25:19.337 CET] [debug] EVENT {"event":"STOP","timestamp":{"microseconds":337351,"seconds":1574346319}}
[2019-11-21T15:25:19.338 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-11-21T15:25:21.364 CET] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 40115 (/usr/bin/isotovideo: backen)
[2019-11-21T15:25:21.368 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/hd0-overlay1 /var/lib/openqa/pool/7/raid/hd0-overlay2 32212254720
[2019-11-21T15:25:21.390 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/hd0-overlay2', fmt=qcow2 size=32212254720 backing_file=/var/lib/openqa/pool/7/raid/hd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-11-21T15:25:21.391 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/cd0-overlay1 /var/lib/openqa/pool/7/raid/cd0-overlay2 665845760
[2019-11-21T15:25:21.410 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/cd0-overlay2', fmt=qcow2 size=665845760 backing_file=/var/lib/openqa/pool/7/raid/cd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-11-21T15:25:21.411 CET] [debug] starting: /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -global isa-fdc.driveA= -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap6,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:03:af -netdev tap,id=qanet1,ifname=tap70,script=no,downscript=no -device virtio-net,netdev=qanet1,mac=52:54:00:12:43:af -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :97,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=virtio_console1,server,nowait,id=virtio_console1,logfile=virtio_console1.log,logappend=on -device virtconsole,chardev=virtio_console1,name=org.openqa.console.virtio_console1 -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay2-file,filename=/var/lib/openqa/pool/7/raid/hd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay2,file=hd0-overlay2-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay2,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay2-file,filename=/var/lib/openqa/pool/7/raid/cd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay2,file=cd0-overlay2-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay2,serial=cd0 -incoming defer
[2019-11-21T15:25:21.417 CET] [debug] Waiting for 0 attempts
[2019-11-21T15:25:22.418 CET] [debug] Waiting for 1 attempts
[2019-11-21T15:25:23.418 CET] [debug] Finished after 2 attempts
[2019-11-21T15:25:23.419 CET] [debug] qmpsocket 14
[2019-11-21T15:25:23.422 CET] [debug] EVENT {"data":{"status":"setup"},"event":"MIGRATION","timestamp":{"microseconds":422209,"seconds":1574346323}}
[2019-11-21T15:25:24.424 CET] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":426431,"seconds":1574346323}}
[2019-11-21T15:25:26.668 CET] [debug] EVENT {"data":{"status":"completed"},"event":"MIGRATION","timestamp":{"microseconds":668448,"seconds":1574346326}}
/usr/lib/os-autoinst/consoles/vnc_base.pm:62:{
'connect_timeout' => 3,
'hostname' => 'localhost',
'port' => 5997
}
[2019-11-21T15:25:26.671 CET] [debug] <<< consoles::virtio_terminal::open_socket(socket_path='/var/lib/openqa/pool/7/virtio_console1')
[2019-11-21T15:25:26.671 CET] [debug] <<< consoles::virtio_terminal::open_socket(socket_path='/var/lib/openqa/pool/7/virtio_console')
[2019-11-21T15:25:26.724 CET] [debug] pointer type 1 0 1024 768 -257
[2019-11-21T15:25:26.724 CET] [debug] led state 0 1 1 -261
[2019-11-21T15:25:26.731 CET] [debug] Restored snapshot
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"data":{"client":{"family":"ipv4","host":"127.0.0.1","service":"50346","websocket":false},"server":{"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5997","websocket":false}},"event":"VNC_CONNECTED","timestamp":{"microseconds":669821,"seconds":1574346326}}
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"data":{"client":{"family":"ipv4","host":"127.0.0.1","service":"50346","websocket":false},"server":{"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5997","websocket":false}},"event":"VNC_INITIALIZED","timestamp":{"microseconds":671201,"seconds":1574346326}}
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":731641,"seconds":1574346326}}
[2019-11-21T15:25:26.777 CET] [debug] Backend process died, backend errors are reported below in the following lines:
The console 'sut' is not responding (half-open socket?). Make sure the console is reachable or disable stall detection on expected disconnects with '$console->disable_vnc_stalls', for example in case of intended machine shutdown
so the problem seems to be related to forcing a rollback to a previous snapshot.
Reproducible¶
Fails sporadically in about 2/3 cases since multiple weeks at least
Expected result¶
20191120-1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by szarate about 5 years ago
- Related to action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added
Updated by szarate about 5 years ago
- Subject changed from "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [functional][u] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link
I just want to discuss it further, and maybe prepare it for later. So adding to the U team's backlog for the time being.
Updated by szarate about 5 years ago
- Subject changed from [functional][u] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [functional][u][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link
Updated by okurz about 5 years ago
- Subject changed from [functional][u][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link
that's actually "network"
Updated by okurz about 5 years ago
- Priority changed from Normal to Urgent
actually "incompletes" should be handled as "Urgent" as also the label carry over would not work without a module mentioned
Updated by asmorodskyi about 5 years ago
Failing at PowerKVM job https://openqa.suse.de/tests/3687312 . Exactly same code run fine on x86 and aarch64 . I would say that tags [network] and [qam] are misleading here . Issue not related to network or QAM . SOmething is broken in revert to snapshot mechanism
Updated by okurz about 5 years ago
Maybe but whatever the test does it seems to be very specific so I suggest to avoid that :) There is a "feature request" hidden inside this ticket to make the problem more obvious, I agree :)
Updated by okurz about 5 years ago
- Subject changed from [network][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link
Updated by okurz about 5 years ago
- Subject changed from [network][qam] auto_review:"The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console.*sut.*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link
Updated by okurz about 5 years ago
asmorodskyi wrote:
Failing at PowerKVM job https://openqa.suse.de/tests/3687312 . Exactly same code run fine on x86 and aarch64 . I would say that tags [network] and [qam] are misleading here . Issue not related to network or QAM . SOmething is broken in revert to snapshot mechanism
https://openqa.suse.de/tests/3697797#next_previous shows that the issue also happens on x86_64 and that it is not always happening, like 9/10 times.
Updated by okurz almost 5 years ago
Latest example: https://openqa.suse.de/tests/3772385/file/autoinst-log.txt
@asmorodskyi I consider this still urgent, on [network], so for you, right?
Updated by okurz almost 5 years ago
- Subject changed from [network][qam] auto_review:"The console.*sut.*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link
Updated by okurz almost 5 years ago
- Status changed from New to In Progress
- Assignee set to okurz
After asking @asmorodskyi clarified to me that he does not want to look into this. I will see about excluding the test module causing the test module from the test schedule to prevent further incompletes which do not add more information.
Using an SQL query to find out where we use this label lately:
select jobs.id, jobs.test, result_dir from jobs, comments where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and jobs.id in (select job_id from comments where id is not null)) and jobs.id = job_id and comments.text ~ 'poo#60161';
so we have 15 and 15-SP1 qam jobs in the job groups 165 and 232. I added in the testsuite over https://openqa.suse.de/admin/test_suites
EXCLUDE_MODULES=t20_teaming_ab_all_link
Triggered with openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3789105 EXCLUDE_MODULES=t20_teaming_ab_all_link
Cloning dependencies of sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit
Created job #3789474: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3789474
Created job #3789475: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3789475
-> https://openqa.suse.de/tests/3789475#
EDIT: Interesting. All modules passed but job still incompletes with "sut is not responding". Trying to also exclude the now latest test module with openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3789105 EXCLUDE_MODULES=t20_teaming_ab_all_link,t19_teaming_ab_nsna_ping
:
Cloning dependencies of sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit
Created job #3789496: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3789496
Created job #3789497: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3789497
Updated by okurz almost 5 years ago
- Due date set to 2020-01-22
- Status changed from In Progress to Feedback
- Target version set to Current Sprint
https://openqa.suse.de/t3789497 -> passed. Updated https://openqa.suse.de/admin/test_suites and monitoring.
Updated by okurz almost 5 years ago
- Due date deleted (
2020-01-22) - Status changed from Feedback to Workable
- Assignee deleted (
okurz) - Priority changed from Urgent to Normal
- Target version deleted (
Current Sprint)
scenario is currently green so the incomplete is averted and another person can now pick up to fix the issue that the two excluded modules triggered.
Updated by okurz almost 5 years ago
seems that product validation tests incomplete now as well as QAM tests again: https://openqa.suse.de/tests/3820231
Seems someone added removed the testsuite setting so I added it back. The comment in the test suite description was still there as well.
Updated by dzedro almost 5 years ago
@okurz when you do something, check if it does what you think it should do!
EXCLUDE_MODULES does not work on qam_wicked_aggregate_ref!
Updated by okurz almost 5 years ago
- Priority changed from Normal to Urgent
As "EXCLUDE_MODULES" was deleted again from settings we are back with incompletes, putting prio back to Urgent
Updated by dzedro almost 5 years ago
QAM tests on sle15+ are passing with WICKED_EXCLUDE=t19_teaming_ab_nsna_ping, appears to me that t19_teaming_ab_nsna_ping is problematic
Updated by tjyrinki_suse over 4 years ago
- Priority changed from Urgent to High
- Start date deleted (
2019-11-22)
Still passing with WICKED_EXCLUDE=t19_teaming_ab_nsna_ping: https://openqa.suse.de/tests/4287132#next_previous
Maybe priority does not need to be Urgent if this workaround has brought an adequate level of testing, but should be still fixed to work fully.
Updated by cfconrad about 4 years ago
Current investigation:
- the error appear also with a minimal set of test-modules like
WICKED_EXCLUDE='^(?!t19_teaming_ab_nsna_ping|t18_teaming_ab_arp_ping)
- If you set
WICKED_TCPDUMP=0
will also avoid the problem - Using only one serial console with
VIRTIO_CONSOLE_NUM=1
, also doesn't show this error. - the problem is in
serial_terminal::upload_file()
! For some reason, if we upload a file from 2Mb we get such error. https://openqa.suse.de/tests/4794708 - I'm not able to reproduce this error constantly in one of my openQA test environments (cfconrad-vm.qa.suse.de, autobot.qa.suse.de or openqa.wicked.suse.de)
Updated by cfconrad about 4 years ago
Did a manual test of using the second virtio_console and doing some output there. The log file is about ~190MB so there are some characters written. But I don't see the error.
https://openqa.suse.de/tests/4896792
Different things are:
- single machine job
- no snapshot loading during each test
Updated by jlausuch about 4 years ago
- Project changed from openQA Tests (public) to 202
- Category deleted (
Bugs in existing tests)
Updated by okurz about 4 years ago
- Project changed from 202 to openQA Tests (public)
- Category set to Bugs in existing tests
guys, please find a better solution than hiding tickets used on public open servers behind a private issue tracker. This is really annoying for outside contributors (and not really supportive of our company values even) and also it failed for the "auto-review" process in https://gitlab.suse.de/openqa/auto-review/pipelines to label the incomplete raising an alert for the team SUSE QE Tools.
Updated by tjyrinki_suse about 4 years ago
- Subject changed from [network][qam] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link to [network][qem] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link
Updated by okurz over 3 years ago
- Priority changed from High to Normal
This ticket was set to "High" priority but was not updated within 120 days which is 4 times the period of the SLO for "High" tickets (30 days) as described on https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives . The ticket will be set to the next lower priority of "Normal".
Updated by cfconrad about 3 years ago
I got a very good reproducer environment so I could restart investigating these things.
1) The socket which pretend to be half open is never one of the consoles it is [[cmdpipe https://github.com/os-autoinst/os-autoinst/blob/master/backend/baseclass.pm#L105]] or [[qemupipe https://github.com/os-autoinst/os-autoinst/blob/master/backend/qemu.pm#L465]]
2) The socket isn't "half-open" as pretended but there is a "bucket" system which check how often read
is triggered.
=> This read is just triggered more often then the bucket expect it, see _CHKSEL_RATE_WAIT_TIME
and _CHKSEL_RATE_HITS
The issue is very easy to trigger by just running script_output
inside a loop for more then 30 seconds.
In a test where I added some debugs to backend/baseclass.pm
I got the following statistics:
CB(13) abs:55 1.83333333333333/s avg:1.83
CB(13) abs:39302 1310.06666666667/s avg:655.95
CB(13) abs:54248 1808.26666666667/s avg:1040.06
CB(13) abs:54544 1818.13333333333/s avg:1234.57
CB(13) abs:54356 1811.86666666667/s avg:1350.03
CB(13) abs:54516 1817.2/s avg:1427.89
CB(13) abs:54350 1811.66666666667/s avg:1482.72
CB(13) abs:54072 1802.4/s avg:1522.68
CB(13) abs:54466 1815.53333333333/s avg:1555.22
CB(13) abs:54936 1831.2/s avg:1582.82
CB(13) abs:54558 1818.6/s avg:1604.25
The abs
value is the absolute value within 30s, the value in the ()
is the fd fileno, the value xxxx/s
is the value per seconds and the avg
value is the avarage.
So we see that we hit the threashold of 15k very easy with this simple test:
use Mojo::Base 'opensusebasetest';
use testapi;
sub run {
my ($self, $ctx) = @_;
$self->select_serial_terminal;
my $start_time = time;
my $duration = get_var('CLEMIX_DURATION', 300);
while (time - $start_time < $duration) {
script_output("true\n", quiet => 1);
}
}
Trigger this test with:
_CHKSEL_RATE_HITS=${1:-15000}
openqa-cli api --host http://openqa-3.wicked.suse.de -X POST jobs \
'ARCH=x86_64' 'BACKEND=qemu' 'BOOT_HDD_IMAGE=1' 'BUILD=baseclass-14_1' \
'CASEDIR=https://github.com/cfconrad/os-autoinst-distri-opensuse#stress_cmdpipe_socket' \
'DESKTOP=textmode' 'DISTRI=opensuse' 'FLAVOR=stress_cmdpipe' 'HDDSIZEGB=30' \
'HDD_1=tumbleweed.qcow2' 'KEEP_GRUB_TIMEOUT=1' 'MACHINE=64bit' \
'NICTYPE=tap' 'NOIMAGES=1' 'NOVIDEO=1' 'OVS_DEBUG=1' \
'QEMUCPU=qemu64' 'QEMUVGA=cirrus' 'TAPDEV=auto,auto' \
'UEFI_PFLASH_VARS=tumbleweed-uefi-vars.qcow2' 'VERSION=Tumbleweed' \
'VIDEOMODE=text' 'VIRTIO_CONSOLE=1' 'VIRTIO_CONSOLE_NUM=2' \
'WORKER_CLASS=openqa-3' '_GROUP_ID=18' '_QUIET_SCRIPT_CALLS=1' \
SCHEDULE='tests/boot/boot_to_desktop,tests/clemix/stress_cmdpipe_socket' \
TEST=clemix_stress_cmdpipe_socket \
_CHKSEL_RATE_HITS=$_CHKSEL_RATE_HITS \
| openqa-wait-job --host http://openqa-3.wicked.suse.de
http://openqa-3.wicked.suse.de/tests/31388 with _CHKSEL_RATE_HITS=15000 => FAILED
http://openqa-3.wicked.suse.de/tests/31387 with _CHKSEL_RATE_HITS=100000 => SUCCESSFUL
Updated by slo-gin about 2 years ago
This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.
Updated by slo-gin 10 months ago
This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.
Updated by cfconrad about 1 month ago
- Status changed from Workable to Resolved
Increase the number of _CHKSEL_RATE_HITS is the answer...