action #60161

[network][qam] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link

Added by okurz 3 months ago. Updated 25 days ago.

Status:WorkableStart date:22/11/2019
Priority:UrgentDue date:
Assignee:-% Done:

0%

Category:Bugs in existing tests
Target version:-
Difficulty:
Duration:

Description

Observation

openQA test in scenario sle-15-Server-DVD-Updates-x86_64-qam_wicked_aggregate_sut@64bit incompletes after t19_teaming_ab_nsna_ping . From autoinst-log.txt:

[2019-11-21T15:25:19.328 CET] [debug] ||| finished t19_teaming_ab_nsna_ping wicked/aggregate/sut at 2019-11-21 14:25:19 (84 s)
[2019-11-21T15:25:19.330 CET] [debug] Loading a VM snapshot lastgood
[2019-11-21T15:25:19.330 CET] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[0]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":600175,"seconds":1574346237}}
[2019-11-21T15:25:19.330 CET] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[1]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":657272,"seconds":1574346237}}
[2019-11-21T15:25:19.331 CET] [debug] Loading snapshot (Current VM state is running).
[2019-11-21T15:25:19.337 CET] [debug] EVENT {"event":"STOP","timestamp":{"microseconds":337351,"seconds":1574346319}}
[2019-11-21T15:25:19.338 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-11-21T15:25:21.364 CET] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 40115 (/usr/bin/isotovideo: backen)
[2019-11-21T15:25:21.368 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/hd0-overlay1 /var/lib/openqa/pool/7/raid/hd0-overlay2 32212254720
[2019-11-21T15:25:21.390 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/hd0-overlay2', fmt=qcow2 size=32212254720 backing_file=/var/lib/openqa/pool/7/raid/hd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-11-21T15:25:21.391 CET] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/raid/cd0-overlay1 /var/lib/openqa/pool/7/raid/cd0-overlay2 665845760
[2019-11-21T15:25:21.410 CET] [debug] Formatting '/var/lib/openqa/pool/7/raid/cd0-overlay2', fmt=qcow2 size=665845760 backing_file=/var/lib/openqa/pool/7/raid/cd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-11-21T15:25:21.411 CET] [debug] starting: /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -global isa-fdc.driveA= -m 1024 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap6,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:03:af -netdev tap,id=qanet1,ifname=tap70,script=no,downscript=no -device virtio-net,netdev=qanet1,mac=52:54:00:12:43:af -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :97,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=virtio_console1,server,nowait,id=virtio_console1,logfile=virtio_console1.log,logappend=on -device virtconsole,chardev=virtio_console1,name=org.openqa.console.virtio_console1 -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay2-file,filename=/var/lib/openqa/pool/7/raid/hd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay2,file=hd0-overlay2-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay2,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay2-file,filename=/var/lib/openqa/pool/7/raid/cd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay2,file=cd0-overlay2-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay2,serial=cd0 -incoming defer
[2019-11-21T15:25:21.417 CET] [debug] Waiting for 0 attempts
[2019-11-21T15:25:22.418 CET] [debug] Waiting for 1 attempts
[2019-11-21T15:25:23.418 CET] [debug] Finished after 2 attempts
[2019-11-21T15:25:23.419 CET] [debug] qmpsocket 14
[2019-11-21T15:25:23.422 CET] [debug] EVENT {"data":{"status":"setup"},"event":"MIGRATION","timestamp":{"microseconds":422209,"seconds":1574346323}}
[2019-11-21T15:25:24.424 CET] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":426431,"seconds":1574346323}}
[2019-11-21T15:25:26.668 CET] [debug] EVENT {"data":{"status":"completed"},"event":"MIGRATION","timestamp":{"microseconds":668448,"seconds":1574346326}}
/usr/lib/os-autoinst/consoles/vnc_base.pm:62:{
  'connect_timeout' => 3,
  'hostname' => 'localhost',
  'port' => 5997
}
[2019-11-21T15:25:26.671 CET] [debug] <<< consoles::virtio_terminal::open_socket(socket_path='/var/lib/openqa/pool/7/virtio_console1')
[2019-11-21T15:25:26.671 CET] [debug] <<< consoles::virtio_terminal::open_socket(socket_path='/var/lib/openqa/pool/7/virtio_console')
[2019-11-21T15:25:26.724 CET] [debug] pointer type 1 0 1024 768 -257
[2019-11-21T15:25:26.724 CET] [debug] led state 0 1 1 -261
[2019-11-21T15:25:26.731 CET] [debug] Restored snapshot
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"data":{"client":{"family":"ipv4","host":"127.0.0.1","service":"50346","websocket":false},"server":{"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5997","websocket":false}},"event":"VNC_CONNECTED","timestamp":{"microseconds":669821,"seconds":1574346326}}
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"data":{"client":{"family":"ipv4","host":"127.0.0.1","service":"50346","websocket":false},"server":{"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5997","websocket":false}},"event":"VNC_INITIALIZED","timestamp":{"microseconds":671201,"seconds":1574346326}}
[2019-11-21T15:25:26.731 CET] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":731641,"seconds":1574346326}}
[2019-11-21T15:25:26.777 CET] [debug] Backend process died, backend errors are reported below in the following lines:
The console 'sut' is not responding (half-open socket?). Make sure the console is reachable or disable stall detection on expected disconnects with '$console->disable_vnc_stalls', for example in case of intended machine shutdown

so the problem seems to be related to forcing a rollback to a previous snapshot.

Reproducible

Fails sporadically in about 2/3 cases since multiple weeks at least

Expected result

20191120-1 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #32746: [sle][tools][remote-backends][hard] Incomplete job becaus... Resolved 05/03/2018

History

#1 Updated by szarate 3 months ago

  • Related to action #32746: [sle][tools][remote-backends][hard] Incomplete job because console isn't responding correctly. Half-open socket on IPMI added

#2 Updated by szarate 3 months ago

  • Subject changed from "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [functional][u] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

I just want to discuss it further, and maybe prepare it for later. So adding to the U team's backlog for the time being.

#3 Updated by szarate 3 months ago

  • Subject changed from [functional][u] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [functional][u][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

#4 Updated by okurz 3 months ago

  • Subject changed from [functional][u][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

that's actually "network"

#5 Updated by okurz 3 months ago

  • Priority changed from Normal to Urgent

actually "incompletes" should be handled as "Urgent" as also the label carry over would not work without a module mentioned

#6 Updated by asmorodskyi 2 months ago

Failing at PowerKVM job https://openqa.suse.de/tests/3687312 . Exactly same code run fine on x86 and aarch64 . I would say that tags [network] and [qam] are misleading here . Issue not related to network or QAM . SOmething is broken in revert to snapshot mechanism

#7 Updated by okurz 2 months ago

Maybe but whatever the test does it seems to be very specific so I suggest to avoid that :) There is a "feature request" hidden inside this ticket to make the problem more obvious, I agree :)

#8 Updated by okurz 2 months ago

  • Subject changed from [network][qam] "The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link

#9 Updated by okurz 2 months ago

  • Subject changed from [network][qam] auto_review:"The console 'sut' is not responding (half-open socket?)." test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console.*sut.*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link

#10 Updated by okurz 2 months ago

asmorodskyi wrote:

Failing at PowerKVM job https://openqa.suse.de/tests/3687312 . Exactly same code run fine on x86 and aarch64 . I would say that tags [network] and [qam] are misleading here . Issue not related to network or QAM . SOmething is broken in revert to snapshot mechanism

https://openqa.suse.de/tests/3697797#next_previous shows that the issue also happens on x86_64 and that it is not always happening, like 9/10 times.

#11 Updated by okurz about 1 month ago

Latest example: https://openqa.suse.de/tests/3772385/file/autoinst-log.txt

@asmorodskyi I consider this still urgent, on [network], so for you, right?

#12 Updated by okurz about 1 month ago

  • Subject changed from [network][qam] auto_review:"The console.*sut.*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link to [network][qam] auto_review:"The console.*(root-virtio-terminal1|sut).*is not responding.*half-open socket" test incompletes in t20_teaming_ab_all_link

#13 Updated by okurz about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to okurz

After asking @asmorodskyi clarified to me that he does not want to look into this. I will see about excluding the test module causing the test module from the test schedule to prevent further incompletes which do not add more information.

Using an SQL query to find out where we use this label lately:

select jobs.id, jobs.test, result_dir from jobs, comments where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and jobs.id in (select job_id from comments where id is not null)) and jobs.id = job_id and comments.text ~ 'poo#60161';

so we have 15 and 15-SP1 qam jobs in the job groups 165 and 232. I added in the testsuite over https://openqa.suse.de/admin/test_suites

EXCLUDE_MODULES=t20_teaming_ab_all_link

Triggered with openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3789105 EXCLUDE_MODULES=t20_teaming_ab_all_link

Cloning dependencies of sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit
Created job #3789474: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3789474
Created job #3789475: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3789475

-> https://openqa.suse.de/tests/3789475#

EDIT: Interesting. All modules passed but job still incompletes with "sut is not responding". Trying to also exclude the now latest test module with openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3789105 EXCLUDE_MODULES=t20_teaming_ab_all_link,t19_teaming_ab_nsna_ping :

Cloning dependencies of sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit
Created job #3789496: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_ref@64bit -> https://openqa.suse.de/t3789496
Created job #3789497: sle-15-SP1-Server-DVD-Updates-x86_64-Build20200115-2-qam_wicked_aggregate_sut@64bit -> https://openqa.suse.de/t3789497

https://openqa.suse.de/t3789497

#14 Updated by okurz about 1 month ago

  • Due date set to 22/01/2020
  • Status changed from In Progress to Feedback
  • Target version set to Current Sprint

#15 Updated by okurz about 1 month ago

  • Due date deleted (22/01/2020)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from Urgent to Normal
  • Target version deleted (Current Sprint)

scenario is currently green so the incomplete is averted and another person can now pick up to fix the issue that the two excluded modules triggered.

#16 Updated by okurz about 1 month ago

seems that product validation tests incomplete now as well as QAM tests again: https://openqa.suse.de/tests/3820231

Seems someone added removed the testsuite setting so I added it back. The comment in the test suite description was still there as well.

#17 Updated by dzedro about 1 month ago

@okurz when you do something, check if it does what you think it should do!
EXCLUDE_MODULES does not work on qam_wicked_aggregate_ref!

#18 Updated by okurz 27 days ago

  • Priority changed from Normal to Urgent

As "EXCLUDE_MODULES" was deleted again from settings we are back with incompletes, putting prio back to Urgent

#19 Updated by dzedro 25 days ago

QAM tests on sle15+ are passing with WICKED_EXCLUDE=t19_teaming_ab_nsna_ping, appears to me that t19_teaming_ab_nsna_ping is problematic

Also available in: Atom PDF