action #70876
Job incompletes with auto_review:"(?s)qemu version detected.*Driver backend collected unknown process.*failed to start VM":retry and no further detail
0%
Description
Observation¶
Not sure why this job incompleted:
[37m[2020-09-02T00:15:10.754 CEST] [debug] Current version is 4.6.1598623827.924c995d [interface v20] [0m[37m[2020-09-02T00:15:10.840 CEST] [debug] git hash in /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse: ab7782333c508e136760d4299a0c32297a5c38fc [0m[37m[2020-09-02T00:15:11.276 CEST] [debug] usingenv DESKTOP=kde … [0m[37m[2020-09-02T00:15:11.278 CEST] [debug] usingenv NOIMAGES=1 [0m[37m[2020-09-02T00:15:11.294 CEST] [debug] scheduling isosize tests/installation/isosize.pm … [0m[37m[2020-09-02T00:15:11.520 CEST] [debug] scheduling shutdown tests/shutdown/shutdown.pm [0m[2020-09-02T00:15:11.577 CEST] [info] cmdsrv: daemon reachable under http://*:20033/FDK5NTQyiHRYvqGw/ [37m[2020-09-02T00:15:11.579 CEST] [debug] git hash in /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles: 73ebc3acc3c3cd24f09d9f6c15edc470eba6e5eb [0m[37m[2020-09-02T00:15:11.580 CEST] [debug] init needles from /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles [0m[2020-09-02T00:15:11.583 CEST] [info] Listening at "http://[::]:20033" Web application available at http://[::]:20033 [37m[2020-09-02T00:15:12.201 CEST] [debug] inst-console-20200224 contains inst-console twice [0m[37m[2020-09-02T00:15:13.370 CEST] [debug] loaded 8184 needles [0m[37m[2020-09-02T00:15:13.634 CEST] [debug] Blocking SIGTERM [0m[37m[2020-09-02T00:15:13.721 CEST] [debug] Unblocking SIGTERM [0m5575: channel_out 15, channel_in 14 [37m[2020-09-02T00:15:13.990 CEST] [debug] Blocking SIGTERM [0m[37m[2020-09-02T00:15:14.122 CEST] [debug] Unblocking SIGTERM [0m5966: cmdpipe 13, rsppipe 16 [37m[2020-09-02T00:15:14.125 CEST] [debug] started mgmt loop with pid 5966 [0mremove_tree qemuscreenshot [37m[2020-09-02T00:15:14.382 CEST] [debug] qemu version detected: 3.1.1.1 [0m[37m[2020-09-02T00:15:14.709 CEST] [debug] running /usr/bin/chattr -f +C /var/lib/openqa/pool/3/raid [0m[37m[2020-09-02T00:15:15.238 CEST] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/3/openSUSE-Jump-15.2-DVD-aarch64-Build72.2-Media.iso [0m[37m[2020-09-02T00:15:15.443 CEST] [debug] Driver backend collected unknown process with pid 6596 and exit status: 0 [0m[37m[2020-09-02T00:15:15.477 CEST] [debug] backend process exited: 0 [0mfailed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. [37m[2020-09-02T00:15:15.485 CEST] [debug] stopping command server 5612 because test execution ended through exception [0m[37m[2020-09-02T00:15:16.690 CEST] [debug] done with command server [0m[37m[2020-09-02T00:15:16.690 CEST] [debug] stopping autotest process 5645 [0m[37m[2020-09-02T00:15:17.893 CEST] [debug] done with autotest process [0m5575: EXIT 1 [2020-09-02T00:15:18.0121 CEST] [info] Isotovideo exit status: 1 [2020-09-02T00:15:18.0153 CEST] [info] +++ worker notes +++ [2020-09-02T00:15:18.0154 CEST] [info] End time: 2020-09-01 22:15:18 [2020-09-02T00:15:18.0154 CEST] [info] Result: died [2020-09-02T00:15:18.0238 CEST] [info] Uploading video.ogv [2020-09-02T00:15:18.0319 CEST] [info] Uploading video_time.vtt [2020-09-02T00:15:18.0355 CEST] [info] Uploading vars.json [2020-09-02T00:15:18.0381 CEST] [info] Uploading autoinst-log.txt
The message isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
is saved as reason but I'm not sure whether it is really the source of the problem or just a symptom.
job reference: https://openqa.opensuse.org/tests/1380901
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#70876
Expected result¶
Problem does not happen anymore or if it happens there is more helpful output.
Suggestions¶
- Check current test code coverage in os-autoinst for the error handling of backend/driver.pm:128 with the line
$self->_send_json({cmd => 'start_vm'}) || die "failed to start VM";
- Ensure the code is properly covered with tests
- Try to reproduce the above problem
- Improve feedback in these cases
Workaround¶
Retrigger
History
#1
Updated by okurz 8 months ago
- Subject changed from Job incompletes with auto_review:"failed to start VM" to Job incompletes with auto_review:"failed to start VM":retry
- Category set to Concrete Bugs
- Target version set to Ready
careful with the auto_review regex. If it is too generic it can catch many jobs. But I think the issue can be worked around by retriggering. THe assigned worker shows no problems in more recent jobs in https://openqa.opensuse.org/admin/workers/131 and the next job in the same scenario https://openqa.opensuse.org/tests/1381761#live also looks fine so far
#2
Updated by okurz 7 months ago
- Subject changed from Job incompletes with auto_review:"failed to start VM":retry to Job incompletes with "failed to start VM" and no further detail
- Difficulty set to hard
The message
isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
is saved as reason but I'm not sure whether it is really the source of the problem or just a symptom.
that is really just a symptom. If you are interested in the current statistics of incompletes on o3 and osd you can find them now on https://gitlab.suse.de/openqa/auto-review/-/jobs/261093 and https://gitlab.suse.de/openqa/auto-review/-/jobs/261092 respectively. Most prominent is this ticket here. I think I was right in #70876#note-1 with "careful with the auto_review regex. If it is too generic it can catch many jobs." Because it did catch too many jobs that failed for a different reason, e.g. https://openqa.suse.de/tests/4716465 with
[37m[2020-09-22T14:08:08.802 CEST] [debug] Backend process died, backend errors are reported below in the following lines: ipmitool -I lanplus -H thunderx21-sp.qa.suse.de -U admin -P admin mc guid: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 65. [0m[37m[2020-09-22T14:08:28.876 CEST] [debug] capture loop failed ipmitool -I lanplus -H thunderx21-sp.qa.suse.de -U admin -P admin chassis power off: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 65. [0m[37m[2020-09-22T14:08:28.877 CEST] [debug] sending magic and exit [0m[37m[2020-09-22T14:08:28.877 CEST] [debug] received magic close [0m[37m[2020-09-22T14:08:28.881 CEST] [debug] backend process exited: 0 [0m[37m[2020-09-22T14:08:28.882 CEST] [debug] Driver backend collected unknown process with pid 30254 and exit status: 0 [0mfailed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
others are still comparable to the original job with no "Backend process died" included, e.g. https://openqa.opensuse.org/tests/1400622
showing
[0m[37m[2020-09-21T20:07:47.198 CEST] [debug] starting: /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -global isa-fdc.driveA= -m 1536 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot order=c -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :99,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/9/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/9/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 [0m[37m[2020-09-21T20:07:47.202 CEST] [debug] Waiting for 0 attempts [0m[37m[2020-09-21T20:07:47.327 CEST] [debug] backend process exited: 0 [0mfailed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
as the original problem is really vague and does not seem to reproduce often and as I failed to come up with a stable regex that prevents the mismatch I remove the regex again, e.g. (?s)(?<!sending magic and exit).*failed to start VM'
with a negative look-behind assertion with following variable length did not work.
#3
Updated by okurz 6 months ago
- Description updated (diff)
- Status changed from New to Workable
Added a DB query to find recent reproducing jobs on production and found
id | t_finished | test | reason ---------+---------------------+--------------------+----------------------------------------------------------------------------------------- 1400622 | 2020-09-21 18:07:54 | extra_tests_on_kde | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. 1380901 | 2020-09-01 22:15:18 | kde | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. (2 rows) id | t_finished | test | reason ---------+---------------------+-----------------------------------------+----------------------------------------------------------------------------------------- 4794996 | 2020-10-08 23:12:06 | ha_supportserver_upgraded_sle12sp3_ltss | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. 4791627 | 2020-10-08 01:45:00 | ltp_cap_bounds | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. (2 rows)
since 2020-09-01. Made the ticket workable with my additions.
#5
Updated by okurz 5 months ago
- Subject changed from Job incompletes with "failed to start VM" and no further detail to Job incompletes with auto_review:"(?s)qemu version detected.*failed to start VM":retry and no further detail
- Description updated (diff)
auto-review failed today due to https://openqa.opensuse.org/tests/1485939 with:
[37m[2020-11-27T12:34:14.858 CET] [debug] started mgmt loop with pid 28732 [0m[37m[2020-11-27T12:34:14.908 CET] [debug] qemu version detected: 4.2.1 [0m[37m[2020-11-27T12:34:14.910 CET] [debug] running /usr/bin/chattr -f +C /var/lib/openqa/pool/5/raid [0m[37m[2020-11-27T12:34:14.925 CET] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/5/obs-server.x86_64-2.10.51-qcow2-Build11.130.qcow2 [0m[37m[2020-11-27T12:34:14.995 CET] [debug] Driver backend collected unknown process with pid 28830 and exit status: 0 [0m[37m[2020-11-27T12:34:14.999 CET] [debug] backend process exited: 0 [0mfailed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126. [37m[2020-11-27T12:34:14.999 CET] [debug] stopping command server 28666 because test execution ended through exception
the job in #70876#note-2 was a non-qemu one. Maybe we can use a regex that simply looks for qemu first?
#6
Updated by okurz 4 months ago
- Description updated (diff)
looking on both o3 and osd I can currently find:
1485939|2020-11-27 11:34:15|done|incomplete|obs_appliance|isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126.|openqaworker4 5128997|2020-12-04 07:37:59|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1 5128990|2020-12-04 07:37:51|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2 5128930|2020-12-04 06:20:40|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1 5128923|2020-12-04 06:20:31|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2 5128898|2020-12-04 05:21:04|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1 5128889|2020-12-04 05:20:58|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2 5128709|2020-12-04 04:20:46|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1 5128702|2020-12-04 04:20:46|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1 5128506|2020-12-04 03:20:41|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2 5128499|2020-12-04 03:20:41|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
interesting log: https://openqa.suse.de/tests/5128990
#7
Updated by okurz 4 months ago
- Subject changed from Job incompletes with auto_review:"(?s)qemu version detected.*failed to start VM":retry and no further detail to Job incompletes with auto_review:"(?s)qemu version detected.*Driver backend collected unknown process.*failed to start VM":retry and no further detail
The previous ones are invalid matches but likely we can add the additional message "Driver backend collected unknown process" in the middle to look for.
#8
Updated by mkittler 4 months ago
About job https://openqa.suse.de/tests/5128990 - it looks like a bug that isotovideo is even started considering the 404 error before the startup:
[2020-12-04T07:37:29.0143 UTC] [debug] [pid:84208] Found UEFI_PFLASH_VARS, caching sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2 [2020-12-04T07:37:29.0151 UTC] [info] [pid:84208] Downloading sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2, request #6176 sent to Cache Service [2020-12-04T07:37:34.0266 UTC] [info] [pid:84208] Download of sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2 processed: [info] [#6176] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB [info] [#6176] Downloading "sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2" from "http://openqa.suse.de/tests/5128990/asset/hdd/sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2" [info] [#6176] Download of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2" failed: 404 Not Found [2020-12-04T07:37:34.0267 UTC] [error] [pid:84208] Failed to download sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2
#9
Updated by mkittler 4 months ago
Actually, we have special handling for UEFI_PFLASH_VARS
assets. So this behavior is explicitly enabled by our code:
if ($this_asset eq 'UEFI_PFLASH_VARS' && !defined $asset) { log_error("Failed to download $asset_uri", channels => 'autoinst'); # assume that if we have a full path, that's what we should use $vars->{$this_asset} = $asset_uri if -e $asset_uri; # don't kill the job if the asset is not found # TODO: This seems to leave the job stuck in some cases (observed in production on openqaworker3). next; }
Maybe this code path should be harmonized with what we already have for the local asset detection:
next if (($key eq 'UEFI_PFLASH_VARS') and !$vars->{UEFI});
So we would already fail here on the worker level unless UEFI
is really not set.