Project

General

Profile

Actions

action #60161

closed

[network][qem] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link

Added by okurz about 5 years ago. Updated 22 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Tags:

Description

Observation

openQA test in scenario sle-15-Server-DVD-Updates-x86_64-qam_wicked_aggregate_sut@64bit incompletes after t19_teaming_ab_nsna_ping . From autoinst-log.txt:

[2019-11-21T15:25:19.328 CET] [debug] ||| finished t19_teaming_ab_nsna_ping wicked/aggregate/sut at 2019-11-21 14:25:19 (84 s)
[2019-11-21T15:25:19.330 CET] [debug] Loading a VM snapshot lastgood
[2019-11-21T15:25:19.330 CET] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[0]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":600175,"seconds":1574346237}}
[2019-11-21T15:25:19.330 CET] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[1]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":657272,"seconds":1574346237}}
[2019-11-21T15:25:19.331 CET] [debug] Loading snapshot (Current VM state is running).
[2019-11-21T15:25:19.337 CET] [debug] EVENT {"event":"STOP","timestamp":{"microseconds":337351,"seconds":1574346319}}
[2019-11-21T15:25:19.338 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-11-21T15:25:21.364 CET] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 40115 (/usr/bin/isotovideo: backen)
[2019-11-21T15:25:21.368 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/hd0-overlay1 /var/lib/openqa/pool/7/raid/hd0-overlay2 32212254720
[2019-11-21T15:25:21.390 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/hd0-overlay2', fmt=qcow2 size=32212254720 backing_file=/var/lib/openqa/pool/7/raid/hd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-11-21T15:25:21.391 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/cd0-overlay1 /var/lib/openqa/pool/7/raid/cd0-overlay2 665845760
[2019-11-21T15:25:21.410 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/cd0-overlay2', fmt=qcow2 size=665845760 backing_file=/var/lib/openqa/pool/7/raid/cd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-11-21T15:25:21.411 CET] [debug] starting: /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -global isa-fdc.driveA= -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap6,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:03:af -netdev tap,id=qanet1,ifname=tap70,script=no,downscript=no -device virtio-net,netdev=qanet1,mac=52:54:00:12:43:af -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :97,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=virtio_console1,server,nowait,id=virtio_console1,logfile=virtio_console1.log,logappend=on -device virtconsole,chardev=virtio_console1,name=org.openqa.console.virtio_console1 -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay2-file,filename=/var/lib/openqa/pool/7/raid/hd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay2,file=hd0-overlay2-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay2,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay2-file,filename=/var/lib/openqa/pool/7/raid/cd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay2,file=cd0-overlay2-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay2,serial=cd0 -incoming defer
[2019-11-21T15:25:21.417 CET] [debug] Waiting for 0 attempts
[2019-11-21T15:25:22.418 CET] [debug] Waiting for 1 attempts
[2019-11-21T15:25:23.418 CET] [debug] Finished after 2 attempts
[2019-11-21T15:25:23.419 CET] [debug] qmpsocket 14
[2019-11-21T15:25:23.422 CET] [debug] EVENT {"data":{"status":"setup"},"event":"MIGRATION","timestamp":{"microseconds":422209,"seconds":1574346323}}
[2019-11-21T15:25:24.424 CET] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":426431,"seconds":1574346323}}
[2019-11-21T15:25:26.668 CET] [debug] EVENT {"data":{"status":"completed"},"event":"MIGRATION","timestamp":{"microseconds":668448,"seconds":1574346326}}
/usr/lib/os-autoinst/consoles/vnc_base.pm:62:{
  'connect_timeout' => 3,
  'hostname' => 'localhost',
  'port' => 5997
}
[2019-11-21T15:25:26.671 CET] [debug] <<< consoles::virtio_terminal::open_socket(socket_path='/var/lib/openqa/pool/7/virtio_console1')
[2019-11-21T15:25:26.671 CET] [debug] <<< consoles::virtio_terminal::open_socket(socket_path='/var/lib/openqa/pool/7/virtio_console')
[2019-11-21T15:25:26.724 CET] [debug] pointer type 1 0 1024 768 -257
[2019-11-21T15:25:26.724 CET] [debug] led state 0 1 1 -261
[2019-11-21T15:25:26.731 CET] [debug] Restored snapshot
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"data":{"client":{"family":"ipv4","host":"127.0.0.1","service":"50346","websocket":false},"server":{"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5997","websocket":false}},"event":"VNC_CONNECTED","timestamp":{"microseconds":669821,"seconds":1574346326}}
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"data":{"client":{"family":"ipv4","host":"127.0.0.1","service":"50346","websocket":false},"server":{"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5997","websocket":false}},"event":"VNC_INITIALIZED","timestamp":{"microseconds":671201,"seconds":1574346326}}
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":731641,"seconds":1574346326}}
[2019-11-21T15:25:26.777 CET] [debug] Backend process died, backend errors are reported below in the following lines:
The console 'sut' is not responding (half-open socket?). Make sure the console is reachable or disable stall detection on expected disconnects with '$console->disable_vnc_stalls', for example in case of intended machine shutdown

so the problem seems to be related to forcing a rollback to a previous snapshot.

Reproducible

Fails sporadically in about 2/3 cases since multiple weeks at least

Expected result

20191120-1 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Tests (public) - action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMIResolvedokurz2018-03-05

Actions
Actions #1

Updated by szarate about 5 years ago

  • Related to action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added
Actions #2

Updated by szarate about 5 years ago

  • Subject changed from "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [functional][u] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

I just want to discuss it further, and maybe prepare it for later. So adding to the U team's backlog for the time being.

Actions #3

Updated by szarate about 5 years ago

  • Subject changed from [functional][u] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [functional][u][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link
Actions #4

Updated by okurz about 5 years ago

  • Subject changed from [functional][u][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

that's actually "network"

Actions #5

Updated by okurz about 5 years ago

  • Priority changed from Normal to Urgent

actually "incompletes" should be handled as "Urgent" as also the label carry over would not work without a module mentioned

Actions #6

Updated by asmorodskyi about 5 years ago

Failing at PowerKVM job https://openqa.suse.de/tests/3687312 . Exactly same code run fine on x86 and aarch64 . I would say that tags [network] and [qam] are misleading here . Issue not related to network or QAM . SOmething is broken in revert to snapshot mechanism

Actions #7

Updated by okurz about 5 years ago

Maybe but whatever the test does it seems to be very specific so I suggest to avoid that :) There is a "feature request" hidden inside this ticket to make the problem more obvious, I agree :)

Actions #8

Updated by okurz about 5 years ago

  • Subject changed from [network][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link
Actions #9

Updated by okurz about 5 years ago

  • Subject changed from [network][qam] auto_review:"The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console.*sut.*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link
Actions #10

Updated by okurz about 5 years ago

asmorodskyi wrote:

Failing at PowerKVM job https://openqa.suse.de/tests/3687312 . Exactly same code run fine on x86 and aarch64 . I would say that tags [network] and [qam] are misleading here . Issue not related to network or QAM . SOmething is broken in revert to snapshot mechanism

https://openqa.suse.de/tests/3697797#next_previous shows that the issue also happens on x86_64 and that it is not always happening, like 9/10 times.

Actions #11

Updated by okurz almost 5 years ago

Latest example: https://openqa.suse.de/tests/3772385/file/autoinst-log.txt

@asmorodskyi I consider this still urgent, on [network], so for you, right?

Actions #12

Updated by okurz almost 5 years ago

  • Subject changed from [network][qam] auto_review:"The console.*sut.*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link
Actions #13

Updated by okurz almost 5 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

After asking @asmorodskyi clarified to me that he does not want to look into this. I will see about excluding the test module causing the test module from the test schedule to prevent further incompletes which do not add more information.

Using an SQL query to find out where we use this label lately:

select jobs.id, jobs.test, result_dir from jobs, comments where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and jobs.id in (select job_id from comments where id is not null)) and jobs.id = job_id and comments.text ~ 'poo#60161';

so we have 15 and 15-SP1 qam jobs in the job groups 165 and 232. I added in the testsuite over https://openqa.suse.de/admin/test_suites

EXCLUDE_MODULES=t20_teaming_ab_all_link

Triggered with openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3789105 EXCLUDE_MODULES=t20_teaming_ab_all_link

Cloning dependencies of sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit
Created job #3789474: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3789474
Created job #3789475: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3789475

-> https://openqa.suse.de/tests/3789475#

EDIT: Interesting. All modules passed but job still incompletes with "sut is not responding". Trying to also exclude the now latest test module with openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3789105 EXCLUDE_MODULES=t20_teaming_ab_all_link,t19_teaming_ab_nsna_ping :

Cloning dependencies of sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit
Created job #3789496: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3789496
Created job #3789497: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3789497

https://openqa.suse.de/t3789497

Actions #14

Updated by okurz almost 5 years ago

  • Due date set to 2020-01-22
  • Status changed from In Progress to Feedback
  • Target version set to Current Sprint
Actions #15

Updated by okurz almost 5 years ago

  • Due date deleted (2020-01-22)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from Urgent to Normal
  • Target version deleted (Current Sprint)

scenario is currently green so the incomplete is averted and another person can now pick up to fix the issue that the two excluded modules triggered.

Actions #16

Updated by okurz almost 5 years ago

seems that product validation tests incomplete now as well as QAM tests again: https://openqa.suse.de/tests/3820231

Seems someone added removed the testsuite setting so I added it back. The comment in the test suite description was still there as well.

Actions #17

Updated by dzedro almost 5 years ago

@okurz when you do something, check if it does what you think it should do!
EXCLUDE_MODULES does not work on qam_wicked_aggregate_ref!

Actions #18

Updated by okurz almost 5 years ago

  • Priority changed from Normal to Urgent

As "EXCLUDE_MODULES" was deleted again from settings we are back with incompletes, putting prio back to Urgent

Actions #19

Updated by dzedro almost 5 years ago

QAM tests on sle15+ are passing with WICKED_EXCLUDE=t19_teaming_ab_nsna_ping, appears to me that t19_teaming_ab_nsna_ping is problematic

Actions #20

Updated by tjyrinki_suse over 4 years ago

  • Priority changed from Urgent to High
  • Start date deleted (2019-11-22)

Still passing with WICKED_EXCLUDE=t19_teaming_ab_nsna_ping: https://openqa.suse.de/tests/4287132#next_previous

Maybe priority does not need to be Urgent if this workaround has brought an adequate level of testing, but should be still fixed to work fully.

Actions #21

Updated by cfconrad about 4 years ago

  • Tags set to wicked
Actions #22

Updated by cfconrad about 4 years ago

  • Assignee set to cfconrad
Actions #23

Updated by cfconrad about 4 years ago

Current investigation:

  • the error appear also with a minimal set of test-modules like WICKED_EXCLUDE='^(?!t19_teaming_ab_nsna_ping|t18_teaming_ab_arp_ping)
  • If you set WICKED_TCPDUMP=0 will also avoid the problem
  • Using only one serial console with VIRTIO_CONSOLE_NUM=1, also doesn't show this error.
  • the problem is in serial_terminal::upload_file()! For some reason, if we upload a file from 2Mb we get such error. https://openqa.suse.de/tests/4794708
  • I'm not able to reproduce this error constantly in one of my openQA test environments (cfconrad-vm.qa.suse.de, autobot.qa.suse.de or openqa.wicked.suse.de)
Actions #24

Updated by cfconrad about 4 years ago

Did a manual test of using the second virtio_console and doing some output there. The log file is about ~190MB so there are some characters written. But I don't see the error.
https://openqa.suse.de/tests/4896792

Different things are:

  • single machine job
  • no snapshot loading during each test
Actions #25

Updated by jlausuch about 4 years ago

  • Project changed from openQA Tests (public) to 202
  • Category deleted (Bugs in existing tests)
Actions #26

Updated by okurz about 4 years ago

  • Project changed from 202 to openQA Tests (public)
  • Category set to Bugs in existing tests

guys, please find a better solution than hiding tickets used on public open servers behind a private issue tracker. This is really annoying for outside contributors (and not really supportive of our company values even) and also it failed for the "auto-review" process in https://gitlab.suse.de/openqa/auto-review/pipelines to label the incomplete raising an alert for the team SUSE QE Tools.

Actions #27

Updated by tjyrinki_suse about 4 years ago

  • Subject changed from [network][qam] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link to [network][qem] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link
Actions #28

Updated by okurz over 3 years ago

  • Priority changed from High to Normal

This ticket was set to "High" priority but was not updated within 120 days which is 4 times the period of the SLO for "High" tickets (30 days) as described on https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives . The ticket will be set to the next lower priority of "Normal".

Actions #29

Updated by cfconrad about 3 years ago

I got a very good reproducer environment so I could restart investigating these things.

1) The socket which pretend to be half open is never one of the consoles it is [[cmdpipe https://github.com/os-autoinst/os-autoinst/blob/master/backend/baseclass.pm#L105]] or [[qemupipe https://github.com/os-autoinst/os-autoinst/blob/master/backend/qemu.pm#L465]]
2) The socket isn't "half-open" as pretended but there is a "bucket" system which check how often read is triggered.

=> This read is just triggered more often then the bucket expect it, see _CHKSEL_RATE_WAIT_TIME and _CHKSEL_RATE_HITS

The issue is very easy to trigger by just running script_output inside a loop for more then 30 seconds.
In a test where I added some debugs to backend/baseclass.pm I got the following statistics:

CB(13) abs:55 1.83333333333333/s avg:1.83
CB(13) abs:39302 1310.06666666667/s avg:655.95
CB(13) abs:54248 1808.26666666667/s avg:1040.06
CB(13) abs:54544 1818.13333333333/s avg:1234.57
CB(13) abs:54356 1811.86666666667/s avg:1350.03
CB(13) abs:54516 1817.2/s avg:1427.89
CB(13) abs:54350 1811.66666666667/s avg:1482.72
CB(13) abs:54072 1802.4/s avg:1522.68
CB(13) abs:54466 1815.53333333333/s avg:1555.22
CB(13) abs:54936 1831.2/s avg:1582.82
CB(13) abs:54558 1818.6/s avg:1604.25

The abs value is the absolute value within 30s, the value in the () is the fd fileno, the value xxxx/s is the value per seconds and the avg value is the avarage.
So we see that we hit the threashold of 15k very easy with this simple test:

use Mojo::Base 'opensusebasetest';
use testapi;

sub run {
    my ($self, $ctx) = @_;
    $self->select_serial_terminal;

    my $start_time = time;
    my $duration = get_var('CLEMIX_DURATION', 300);

    while (time - $start_time < $duration) {
        script_output("true\n", quiet => 1);
    }
}

Trigger this test with:

_CHKSEL_RATE_HITS=${1:-15000}
openqa-cli api --host http://openqa-3.wicked.suse.de -X POST jobs \
    'ARCH=x86_64' 'BACKEND=qemu' 'BOOT_HDD_IMAGE=1' 'BUILD=baseclass-14_1' \
    'CASEDIR=https://github.com/cfconrad/os-autoinst-distri-opensuse#stress_cmdpipe_socket' \
    'DESKTOP=textmode' 'DISTRI=opensuse' 'FLAVOR=stress_cmdpipe' 'HDDSIZEGB=30' \
    'HDD_1=tumbleweed.qcow2' 'KEEP_GRUB_TIMEOUT=1' 'MACHINE=64bit' \
    'NICTYPE=tap' 'NOIMAGES=1' 'NOVIDEO=1' 'OVS_DEBUG=1' \
    'QEMUCPU=qemu64' 'QEMUVGA=cirrus' 'TAPDEV=auto,auto' \
    'UEFI_PFLASH_VARS=tumbleweed-uefi-vars.qcow2' 'VERSION=Tumbleweed' \
    'VIDEOMODE=text' 'VIRTIO_CONSOLE=1' 'VIRTIO_CONSOLE_NUM=2' \
    'WORKER_CLASS=openqa-3' '_GROUP_ID=18' '_QUIET_SCRIPT_CALLS=1' \
    SCHEDULE='tests/boot/boot_to_desktop,tests/clemix/stress_cmdpipe_socket' \
    TEST=clemix_stress_cmdpipe_socket \
    _CHKSEL_RATE_HITS=$_CHKSEL_RATE_HITS \
    | openqa-wait-job --host http://openqa-3.wicked.suse.de

http://openqa-3.wicked.suse.de/tests/31388 with _CHKSEL_RATE_HITS=15000 => FAILED
http://openqa-3.wicked.suse.de/tests/31387 with _CHKSEL_RATE_HITS=100000 => SUCCESSFUL

Actions #30

Updated by slo-gin about 2 years ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions #31

Updated by slo-gin 9 months ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions #32

Updated by cfconrad 22 days ago

  • Status changed from Workable to Resolved

Increase the number of _CHKSEL_RATE_HITS is the answer...

Actions

Also available in: Atom PDF