Project

General

Profile

action #55505

lots of incompletes on o3, "qemu-system-x86_64: -vnc :91,share=force-shared: Failed to find an available port: Address already in use", maybe only openqaworker4:1?

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
Start date:
2019-08-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.opensuse.org/tests/1006423/file/autoinst-log.txt :

[2019-08-14T09:47:55.169 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
can't open qmp at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 405.

[2019-08-14T09:47:55.169 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-08-14T09:47:55.170 CEST] [debug] flushing frames
last frame
[2019-08-14T09:47:55.225 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-08-14T09:47:55.227 CEST] [debug] QEMU: QEMU emulator version 3.1.0 (openSUSE Leap 15.1)
[2019-08-14T09:47:55.227 CEST] [debug] QEMU: Copyright (c) 2003-2018 Fabrice Bellard and the QEMU Project developers
[2019-08-14T09:47:55.227 CEST] [debug] QEMU: qemu-system-x86_64: -vnc :91,share=force-shared: Failed to find an available port: Address already in use
[2019-08-14T09:47:55.228 CEST] [debug] sending magic and exit

Problem

did all the incompletes happen on w4:1? somehow it seems a qemu process is left over and blocking the VNC port


Related issues

Related to openQA Project - action #55415: restart "recent incompletes" or "incompletes of today" easier then over web UINew2019-08-13

Related to openQA Tests - action #48671: [opensuse] save_memory_dump make isotovideo to failResolved2019-03-05

Has duplicate openQA Tests - action #55490: [sle][tools]can't open qmp at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 405Rejected2019-08-14

Copied to openQA Project - action #55526: detection for still running qemu instance in worker only works with "--no-cleanup"Resolved2019-08-14

History

#1 Updated by okurz over 2 years ago

did not find a left-over qemu process blocking port vnc :91 . Anyway, I restarted the service openqa-worker@1 and retriggered incompletes I could find. Monitoring https://openqa.opensuse.org/admin/workers/92

#2 Updated by okurz over 2 years ago

  • Status changed from In Progress to Feedback

Managed to restart all incompletes with

for i in $(ssh o3 "sudo -u geekotest psql --no-align --tuples-only --command=\"select id from jobs where (assigned_worker_id=(select id from workers where (host='openqaworker4' and instance='1') and result='incomplete' and t_finished >= '2019-08-14'));\" openqa"); do openqa-client --host openqa.opensuse.org jobs/$i/restart post; done

#3 Updated by okurz over 2 years ago

Also added the above code as script in https://github.com/os-autoinst/scripts/pull/5

#4 Updated by okurz over 2 years ago

  • Related to action #55415: restart "recent incompletes" or "incompletes of today" easier then over web UI added

#5 Updated by okurz over 2 years ago

[14/08/2019 13:04:39] <DimStar> okurz: there are also incompletes on ow1:11 https://openqa.opensuse.org/tests/1006682#downloads

see https://openqa.opensuse.org/admin/workers/49

Disabled the worker for investigation and restarted jobs:

env worker=openqaworker1 instance=11 ~/local/os-autoinst/scripts/openqa-restart-incompletes-on-worker-instance

The first incomplete of today seems to be https://openqa.opensuse.org/tests/1006973

which states:

[2019-08-14T12:07:20.783 CEST] [debug] Migrating the machine (Current VM state is running).
[2019-08-14T12:07:20.792 CEST] [debug] EVENT {"event":"STOP","timestamp":{"microseconds":792701,"seconds":1565777240}}
[2019-08-14T12:07:20.793 CEST] [debug] EVENT {"data":{"status":"setup"},"event":"MIGRATION","timestamp":{"microseconds":793378,"seconds":1565777240}}
[2019-08-14T12:07:21.294 CEST] [debug] EVENT {"data":{"pass":1},"event":"MIGRATION_PASS","timestamp":{"microseconds":800778,"seconds":1565777240}}
[2019-08-14T12:07:21.294 CEST] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":800848,"seconds":1565777240}}
[2019-08-14T12:07:21.294 CEST] [debug] Migrating total bytes:       1615667200
[2019-08-14T12:07:21.294 CEST] [debug] Migrating remaining bytes:       811651072
[2019-08-14T12:07:21.795 CEST] [debug] EVENT {"data":{"pass":2},"event":"MIGRATION_PASS","timestamp":{"microseconds":671398,"seconds":1565777241}}
[2019-08-14T12:07:21.795 CEST] [debug] EVENT {"data":{"pass":3},"event":"MIGRATION_PASS","timestamp":{"microseconds":671665,"seconds":1565777241}}
[2019-08-14T12:07:21.864 CEST] [debug] EVENT {"data":{"status":"completed"},"event":"MIGRATION","timestamp":{"microseconds":864223,"seconds":1565777241}}
[2019-08-14T12:07:21.865 CEST] [debug] Migrating total bytes:       1615667200
[2019-08-14T12:07:21.865 CEST] [debug] Migrating remaining bytes:       0
[2019-08-14T12:07:21.865 CEST] [debug] Memory dump completed.
[2019-08-14T12:07:23.034 CEST] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":33935,"seconds":1565777243}}
ulogs/first_boot-vm-memory-dump: 237.6 MiB / 996.0 MiB = 0.239, 2.3 MiB/s, 7:12
[2019-08-14T12:14:35.791 CEST] [debug] backend process exited: 0
[2019-08-14T12:14:35.792 CEST] [debug] sysread failed: 
[2019-08-14T12:14:35.792 CEST] [debug] THERE IS NOTHING TO READ 15 4 3
[2019-08-14T12:14:35.793 CEST] [debug] killing command server 19153 because test execution ended
[2019-08-14T12:14:35.793 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20113/wMJ27e8Oec4yok7d/broadcast
last frame
[2019-08-14T12:14:35.814 CEST] [debug] Driver backend collected unknown process with pid 19203 and exit status: 0
[2019-08-14T12:14:50.809 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 175.

[2019-08-14T12:14:51.812 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.

[2019-08-14T12:14:51.819 CEST] [debug] commands process exited: 0
[2019-08-14T12:14:52.819 CEST] [debug] done with command server
[2019-08-14T12:14:52.819 CEST] [debug] killing autotest process 19162
[2019-08-14T12:14:52.820 CEST] [debug] sysread failed: 
[2019-08-14T12:14:52.820 CEST] [debug] /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/tests/installation/first_boot.pm:107 called bootbasetest::post_fail_hook
[2019-08-14T12:14:52.820 CEST] [debug] <<< testapi::select_console(testapi_console='root-console')
[2019-08-14T12:14:52.821 CEST] [debug] post_fail_hook failed: syswrite failed Broken pipe at /usr/lib/os-autoinst/myjsonrpc.pm line 40.

in the former case of openqaworker4:11 it was https://openqa.opensuse.org/tests/1006320/file/autoinst-log.txt with what looks like the same error, also in "first_boot", just in or after savevm.

and now it's openqaworker4:16

I can see that there is a qemu process left running:

_openqa+ 32632 47.8  0.6 6499756 1731208 ?     Sl   12:40  20:09  \_ /usr/bin/qemu-system-x86_64 -vga cirrus -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -global isa-fdc.driveA= -m 1536 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot once=d,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :106,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/16/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/16/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0

Stopping the systemd service stops the process though it does not really belong to the most recent job.

The offending job seems to be https://openqa.opensuse.org/tests/1006627/file/autoinst-log.txt where the timestamps coincide with the start time of the above qemu process so 12:40. The log content looks similar although it's not the "first_boot" module but "boot_to_desktop":

[2019-08-14T13:04:37.725 CEST] [debug] Migrating the machine (Current VM state is running).
[2019-08-14T13:04:37.734 CEST] [debug] EVENT {"event":"STOP","timestamp":{"microseconds":734786,"seconds":1565780677}}
[2019-08-14T13:04:37.735 CEST] [debug] EVENT {"data":{"status":"setup"},"event":"MIGRATION","timestamp":{"microseconds":735364,"seconds":1565780677}}
[2019-08-14T13:04:38.236 CEST] [debug] EVENT {"data":{"pass":1},"event":"MIGRATION_PASS","timestamp":{"microseconds":742619,"seconds":1565780677}}
[2019-08-14T13:04:38.236 CEST] [debug] EVENT {"data":{"status":"active"},"event":"MIGRATION","timestamp":{"microseconds":742688,"seconds":1565780677}}
[2019-08-14T13:04:38.236 CEST] [debug] Migrating total bytes:       1615667200
[2019-08-14T13:04:38.236 CEST] [debug] Migrating remaining bytes:       465211392
[2019-08-14T13:04:38.736 CEST] [debug] EVENT {"data":{"pass":2},"event":"MIGRATION_PASS","timestamp":{"microseconds":356784,"seconds":1565780678}}
[2019-08-14T13:04:38.737 CEST] [debug] EVENT {"data":{"pass":3},"event":"MIGRATION_PASS","timestamp":{"microseconds":357118,"seconds":1565780678}}
[2019-08-14T13:04:38.737 CEST] [debug] EVENT {"data":{"status":"completed"},"event":"MIGRATION","timestamp":{"microseconds":550282,"seconds":1565780678}}
[2019-08-14T13:04:38.737 CEST] [debug] Migrating total bytes:       1615667200
[2019-08-14T13:04:38.737 CEST] [debug] Migrating remaining bytes:       0
[2019-08-14T13:04:38.737 CEST] [debug] Memory dump completed.
[2019-08-14T13:04:40.744 CEST] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":744225,"seconds":1565780680}}
ulogs/boot_to_desktop-vm-memory-dump: 137.2 MiB / 695.2 MiB = 0.197, 2.2 MiB/s, 5:11
[2019-08-14T13:09:51.972 CEST] [debug] backend process exited: 0
[2019-08-14T13:09:51.972 CEST] [debug] sysread failed: 
[2019-08-14T13:09:51.973 CEST] [debug] THERE IS NOTHING TO READ 15 4 3
[2019-08-14T13:09:51.973 CEST] [debug] killing command server 32568 because test execution ended
[2019-08-14T13:09:51.973 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20163/5XdlseWl0U7TWQDN/broadcast
last frame
[2019-08-14T13:09:52.012 CEST] [debug] Driver backend collected unknown process with pid 32603 and exit status: 0
[2019-08-14T13:10:06.991 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 175.

[2019-08-14T13:10:07.993 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.

[2019-08-14T13:10:08.000 CEST] [debug] commands process exited: 0
[2019-08-14T13:10:09.000 CEST] [debug] done with command server
[2019-08-14T13:10:09.000 CEST] [debug] killing autotest process 32570
[2019-08-14T13:10:09.000 CEST] [debug] sysread failed: 
[2019-08-14T13:10:09.001 CEST] [debug] <<< testapi::select_console(testapi_console='root-console')

Let's look into bisecting:

openqaworker4:~ # grep 'os-autoinst\>' /var/log/zypp/history  | grep 2019-08-1
2019-08-10 00:02:28|install|os-autoinst|4.5.1565345228.566099f8-lp151.158.1|x86_64||devel_openQA|4d85dfc37280093eb4331ab54b7c63b114cc7d58894c91f261adbf5bcf7c3370|
2019-08-13 00:46:10|install|os-autoinst|4.5.1565639840.222a8dd4-lp151.159.1|x86_64||devel_openQA|31b1dba3739e3b40ceabcfd6d61bfd3204cacbf886172f2c7c358bd2595724f2|

this resolves to a single commit:

$ git log1 --no-merges 566099f8..222a8dd4
f9c71d0e Fix missing data while reading from virtio_console

For tests I take the git hash of the job from the ticket description as "first bad", "last good" could be e.g. one job in before in the same scenario: https://openqa.opensuse.org/tests/1006324 even though the problem might not appear in all cases so we should need to select an older job instead as reference.

Let's look for something obvious in the smaller set first:

git log1 --no-merges 561cdcf43bc26fb09bcec99623ac8cc51e1ae321..e6dbdbaea80802a7c6c4a107341f07bbab47e4f9
870e56bb8 wicked: Improve barriers handling
3674a01d9 LTP: Fix tag name for verification builds via github PR
c946ecc07 LTP: Don't call use_ssh_serial_console() for IPMI
a566493d1 select_serial_terminal: Document variables in perlpod
ef3444aaf system_prepare: remove unused import
ae5c7b854 Remove create_autoyast test from MicroOS
77e43e8e8 Add HANA_HDBLCM variable for the hdblcm path
5643ef704 wicked: Fix macvtap failure
23bae5e29 Skip systemd-logind restart when using saptune
e9df55613 Fix hana cli installation for saptune v2
8f3086649 Check registered addons before&after migration

which does not show anything obvious.

#6 Updated by okurz over 2 years ago

  • Related to action #48671: [opensuse] save_memory_dump make isotovideo to fail added

#7 Updated by okurz over 2 years ago

  • Status changed from Feedback to In Progress

Consulted with cfconrad and he also suspects the os-autoinst check. The problem has appeared multiple times with same symptoms, I have restarted the worker instance systemd job along with all incomplete jobs. If it happens again I will revert openqaworker1 and openqaworker4 to the previous snapshot and revert the change in os-autoinst to be sure.

Seems I could reproduce the problem on lord.arch: http://lord.arch.suse.de/tests/2513 causes the following error in the worker log:

[debug] [pid:5239] Job 2513 from http://localhost finished - reason: died
[error] [pid:5239] A QEMU instance using the current pool directory is still running (PID: 6349)
[error] [pid:5239] A QEMU instance using the current pool directory is still running (PID: 6349)

Manually killed the qemu instance and retriggered a job.

I wonder why we do not see the same error message on production workers but I suspect that the check can only work if the pool directory is not pruned which is certainly a bug. I simulated that behaviour with deleting the content from the pool directory. which could reproduce the exact problem: http://lord.arch.suse.de/tests/2522/file/autoinst-log.txt

Meanwhile happened again as a consequence of https://openqa.opensuse.org/tests/1007331#

so repaired with:

export worker=openqaworker4; export instance=1; sshpass -p $pw ssh -J o3 root@$worker "systemctl restart openqa-worker@$instance" && sh -ex ~/local/os-autoinst/scripts/openqa-restart-incompletes-on-worker-instance

On lord.arch reverted said os-autoinst commit and retriggered.

Also created https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8202 to prevent save_memory_dump causing more incompletes.

https://github.com/os-autoinst/os-autoinst/pull/1198 created and merged. Now I could rollback the workers with e.g. for i in imagetester openqaworker1 openqaworker4; do echo $i && ssh root@$i "transactional-update rollback last && reboot"; done but could be less disruptive if I install the new package as soon as it is available and only reboot after the test load has reduced due to the fact that I have a workaround available? Anyway, to get it over with I just did the rollback and reboot.

#8 Updated by okurz over 2 years ago

  • Copied to action #55526: detection for still running qemu instance in worker only works with "--no-cleanup" added

#9 Updated by okurz over 2 years ago

  • Status changed from In Progress to Feedback

no more related incompletes on https://openqa.opensuse.org/tests?&resultfilter=Incomplete , I can look again tomorrow.

#10 Updated by okurz over 2 years ago

  • Has duplicate action #55490: [sle][tools]can't open qmp at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 405 added

#11 Updated by okurz over 2 years ago

  • Status changed from Feedback to Resolved

no more related problems. I leave the rest to cfconrad who certainly wants to bring in his changes again in a better way :)

#12 Updated by cfconrad over 2 years ago

problem with save_memory_dump(), should be fixed with https://github.com/os-autoinst/os-autoinst/pull/1201

Also available in: Atom PDF