Project

General

Profile

action #60161

[network][qem] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link

Added by okurz almost 2 years ago. Updated about 2 months ago.

Status:
Workable
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Tags:

Description

Observation

openQA test in scenario sle-15-Server-DVD-Updates-x86_64-qam_wicked_aggregate_sut@64bit incompletes after t19_teaming_ab_nsna_ping . From autoinst-log.txt:

[2019-11-21T15:25:19.328 CET] [debug] ||| finished t19_teaming_ab_nsna_ping wicked/aggregate/sut at 2019-11-21 14:25:19 (84 s)
[2019-11-21T15:25:19.330 CET] [debug] Loading a VM snapshot lastgood
[2019-11-21T15:25:19.330 CET] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[0]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":600175,"seconds":1574346237}}
[2019-11-21T15:25:19.330 CET] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[1]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":657272,"seconds":1574346237}}
[2019-11-21T15:25:19.331 CET] [debug] Loading snapshot (Current VM state is running).
[2019-11-21T15:25:19.337 CET] [debug] EVENT {"event":"STOP","timestamp":{"microseconds":337351,"seconds":1574346319}}
[2019-11-21T15:25:19.338 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-11-21T15:25:21.364 CET] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 40115 (/usr/bin/isotovideo: backen)
[2019-11-21T15:25:21.368 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/hd0-overlay1 /var/lib/openqa/pool/7/raid/hd0-overlay2 32212254720
[2019-11-21T15:25:21.390 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/hd0-overlay2', fmt=qcow2 size=32212254720 backing_file=/var/lib/openqa/pool/7/raid/hd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-11-21T15:25:21.391 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/cd0-overlay1 /var/lib/openqa/pool/7/raid/cd0-overlay2 665845760
[2019-11-21T15:25:21.410 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/cd0-overlay2', fmt=qcow2 size=665845760 backing_file=/var/lib/openqa/pool/7/raid/cd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-11-21T15:25:21.411 CET] [debug] starting: /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -global isa-fdc.driveA= -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap6,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:03:af -netdev tap,id=qanet1,ifname=tap70,script=no,downscript=no -device virtio-net,netdev=qanet1,mac=52:54:00:12:43:af -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :97,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=virtio_console1,server,nowait,id=virtio_console1,logfile=virtio_console1.log,logappend=on -device virtconsole,chardev=virtio_console1,name=org.openqa.console.virtio_console1 -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay2-file,filename=/var/lib/openqa/pool/7/raid/hd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay2,file=hd0-overlay2-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay2,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay2-file,filename=/var/lib/openqa/pool/7/raid/cd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay2,file=cd0-overlay2-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay2,serial=cd0 -incoming defer
[2019-11-21T15:25:21.417 CET] [debug] Waiting for 0 attempts
[2019-11-21T15:25:22.418 CET] [debug] Waiting for 1 attempts
[2019-11-21T15:25:23.418 CET] [debug] Finished after 2 attempts
[2019-11-21T15:25:23.419 CET] [debug] qmpsocket 14
[2019-11-21T15:25:23.422 CET] [debug] EVENT {"data":{"status":"setup"},"event":"MIGRATION","timestamp":{"microseconds":422209,"seconds":1574346323}}
[2019-11-21T15:25:24.424 CET] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":426431,"seconds":1574346323}}
[2019-11-21T15:25:26.668 CET] [debug] EVENT {"data":{"status":"completed"},"event":"MIGRATION","timestamp":{"microseconds":668448,"seconds":1574346326}}
/usr/lib/os-autoinst/consoles/vnc_base.pm:62:{
  'connect_timeout' => 3,
  'hostname' => 'localhost',
  'port' => 5997
}
[2019-11-21T15:25:26.671 CET] [debug] <<< consoles::virtio_terminal::open_socket(socket_path='/var/lib/openqa/pool/7/virtio_console1')
[2019-11-21T15:25:26.671 CET] [debug] <<< consoles::virtio_terminal::open_socket(socket_path='/var/lib/openqa/pool/7/virtio_console')
[2019-11-21T15:25:26.724 CET] [debug] pointer type 1 0 1024 768 -257
[2019-11-21T15:25:26.724 CET] [debug] led state 0 1 1 -261
[2019-11-21T15:25:26.731 CET] [debug] Restored snapshot
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"data":{"client":{"family":"ipv4","host":"127.0.0.1","service":"50346","websocket":false},"server":{"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5997","websocket":false}},"event":"VNC_CONNECTED","timestamp":{"microseconds":669821,"seconds":1574346326}}
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"data":{"client":{"family":"ipv4","host":"127.0.0.1","service":"50346","websocket":false},"server":{"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5997","websocket":false}},"event":"VNC_INITIALIZED","timestamp":{"microseconds":671201,"seconds":1574346326}}
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":731641,"seconds":1574346326}}
[2019-11-21T15:25:26.777 CET] [debug] Backend process died, backend errors are reported below in the following lines:
The console 'sut' is not responding (half-open socket?). Make sure the console is reachable or disable stall detection on expected disconnects with '$console->disable_vnc_stalls', for example in case of intended machine shutdown

so the problem seems to be related to forcing a rollback to a previous snapshot.

Reproducible

Fails sporadically in about 2/3 cases since multiple weeks at least

Expected result

20191120-1 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMIResolved2018-03-05

History

#1 Updated by szarate almost 2 years ago

  • Related to action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added

#2 Updated by szarate almost 2 years ago

  • Subject changed from "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [functional][u] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

I just want to discuss it further, and maybe prepare it for later. So adding to the U team's backlog for the time being.

#3 Updated by szarate almost 2 years ago

  • Subject changed from [functional][u] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [functional][u][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

#4 Updated by okurz almost 2 years ago

  • Subject changed from [functional][u][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

that's actually "network"

#5 Updated by okurz almost 2 years ago

  • Priority changed from Normal to Urgent

actually "incompletes" should be handled as "Urgent" as also the label carry over would not work without a module mentioned

#6 Updated by asmorodskyi almost 2 years ago

Failing at PowerKVM job https://openqa.suse.de/tests/3687312 . Exactly same code run fine on x86 and aarch64 . I would say that tags [network] and [qam] are misleading here . Issue not related to network or QAM . SOmething is broken in revert to snapshot mechanism

#7 Updated by okurz almost 2 years ago

Maybe but whatever the test does it seems to be very specific so I suggest to avoid that :) There is a "feature request" hidden inside this ticket to make the problem more obvious, I agree :)

#8 Updated by okurz almost 2 years ago

  • Subject changed from [network][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

#9 Updated by okurz almost 2 years ago

  • Subject changed from [network][qam] auto_review:"The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console.*sut.*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link

#10 Updated by okurz almost 2 years ago

asmorodskyi wrote:

Failing at PowerKVM job https://openqa.suse.de/tests/3687312 . Exactly same code run fine on x86 and aarch64 . I would say that tags [network] and [qam] are misleading here . Issue not related to network or QAM . SOmething is broken in revert to snapshot mechanism

https://openqa.suse.de/tests/3697797#next_previous shows that the issue also happens on x86_64 and that it is not always happening, like 9/10 times.

#11 Updated by okurz almost 2 years ago

Latest example: https://openqa.suse.de/tests/3772385/file/autoinst-log.txt

asmorodskyi I consider this still urgent, on [network], so for you, right?

#12 Updated by okurz almost 2 years ago

  • Subject changed from [network][qam] auto_review:"The console.*sut.*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link

#13 Updated by okurz almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

After asking asmorodskyi clarified to me that he does not want to look into this. I will see about excluding the test module causing the test module from the test schedule to prevent further incompletes which do not add more information.

Using an SQL query to find out where we use this label lately:

select jobs.id, jobs.test, result_dir from jobs, comments where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and jobs.id in (select job_id from comments where id is not null)) and jobs.id = job_id and comments.text ~ 'poo#60161';

so we have 15 and 15-SP1 qam jobs in the job groups 165 and 232. I added in the testsuite over https://openqa.suse.de/admin/test_suites

EXCLUDE_MODULES=t20_teaming_ab_all_link

Triggered with openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3789105 EXCLUDE_MODULES=t20_teaming_ab_all_link

Cloning dependencies of sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit
Created job #3789474: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3789474
Created job #3789475: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3789475

-> https://openqa.suse.de/tests/3789475#

EDIT: Interesting. All modules passed but job still incompletes with "sut is not responding". Trying to also exclude the now latest test module with openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3789105 EXCLUDE_MODULES=t20_teaming_ab_all_link,t19_teaming_ab_nsna_ping :

Cloning dependencies of sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit
Created job #3789496: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3789496
Created job #3789497: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3789497

https://openqa.suse.de/t3789497

#14 Updated by okurz almost 2 years ago

  • Due date set to 2020-01-22
  • Status changed from In Progress to Feedback
  • Target version set to Current Sprint

#15 Updated by okurz over 1 year ago

  • Due date deleted (2020-01-22)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from Urgent to Normal
  • Target version deleted (Current Sprint)

scenario is currently green so the incomplete is averted and another person can now pick up to fix the issue that the two excluded modules triggered.

#16 Updated by okurz over 1 year ago

seems that product validation tests incomplete now as well as QAM tests again: https://openqa.suse.de/tests/3820231

Seems someone added removed the testsuite setting so I added it back. The comment in the test suite description was still there as well.

#17 Updated by dzedro over 1 year ago

okurz when you do something, check if it does what you think it should do!
EXCLUDE_MODULES does not work on qam_wicked_aggregate_ref!

#18 Updated by okurz over 1 year ago

  • Priority changed from Normal to Urgent

As "EXCLUDE_MODULES" was deleted again from settings we are back with incompletes, putting prio back to Urgent

#19 Updated by dzedro over 1 year ago

QAM tests on sle15+ are passing with WICKED_EXCLUDE=t19_teaming_ab_nsna_ping, appears to me that t19_teaming_ab_nsna_ping is problematic

#20 Updated by tjyrinki_suse over 1 year ago

  • Priority changed from Urgent to High
  • Start date deleted (2019-11-22)

Still passing with WICKED_EXCLUDE=t19_teaming_ab_nsna_ping: https://openqa.suse.de/tests/4287132#next_previous

Maybe priority does not need to be Urgent if this workaround has brought an adequate level of testing, but should be still fixed to work fully.

#21 Updated by cfconrad about 1 year ago

  • Tags set to wicked

#22 Updated by cfconrad about 1 year ago

  • Assignee set to cfconrad

#23 Updated by cfconrad about 1 year ago

Current investigation:

  • the error appear also with a minimal set of test-modules like WICKED_EXCLUDE='^(?!t19_teaming_ab_nsna_ping|t18_teaming_ab_arp_ping)
  • If you set WICKED_TCPDUMP=0 will also avoid the problem
  • Using only one serial console with VIRTIO_CONSOLE_NUM=1, also doesn't show this error.
  • the problem is in serial_terminal::upload_file()! For some reason, if we upload a file from 2Mb we get such error. https://openqa.suse.de/tests/4794708
  • I'm not able to reproduce this error constantly in one of my openQA test environments (cfconrad-vm.qa.suse.de, autobot.qa.suse.de or openqa.wicked.suse.de)

#24 Updated by cfconrad 12 months ago

Did a manual test of using the second virtio_console and doing some output there. The log file is about ~190MB so there are some characters written. But I don't see the error.
https://openqa.suse.de/tests/4896792

Different things are:

  • single machine job
  • no snapshot loading during each test

#25 Updated by jlausuch 12 months ago

  • Project changed from openQA Tests to Wicked
  • Category deleted (Bugs in existing tests)

#26 Updated by okurz 11 months ago

  • Project changed from Wicked to openQA Tests
  • Category set to Bugs in existing tests

guys, please find a better solution than hiding tickets used on public open servers behind a private issue tracker. This is really annoying for outside contributors (and not really supportive of our company values even) and also it failed for the "auto-review" process in https://gitlab.suse.de/openqa/auto-review/pipelines to label the incomplete raising an alert for the team SUSE QE Tools.

#27 Updated by tjyrinki_suse 11 months ago

  • Subject changed from [network][qam] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link to [network][qem] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link

#28 Updated by okurz about 2 months ago

  • Priority changed from High to Normal

This ticket was set to "High" priority but was not updated within 120 days which is 4 times the period of the SLO for "High" tickets (30 days) as described on https://progress.opensuse.org/projects/openqatests/wiki/Wiki#SLOs-service-level-objectives . The ticket will be set to the next lower priority of "Normal".

Also available in: Atom PDF