Project

General

Profile

Actions

action #70876

open

Job incompletes with auto_review:"(?s)qemu version detected.*Driver backend collected unknown process.*failed to start VM":retry and no further detail

Added by mkittler over 3 years ago. Updated over 3 years ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2020-09-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

Not sure why this job incompleted:

[2020-09-02T00:15:10.754 CEST] [debug] Current version is 4.6.1598623827.924c995d [interface v20]
[2020-09-02T00:15:10.840 CEST] [debug] git hash in /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse: ab7782333c508e136760d4299a0c32297a5c38fc
[2020-09-02T00:15:11.276 CEST] [debug] usingenv DESKTOP=kde
…
[2020-09-02T00:15:11.278 CEST] [debug] usingenv NOIMAGES=1
[2020-09-02T00:15:11.294 CEST] [debug] scheduling isosize tests/installation/isosize.pm
…
[2020-09-02T00:15:11.520 CEST] [debug] scheduling shutdown tests/shutdown/shutdown.pm
[2020-09-02T00:15:11.577 CEST] [info] cmdsrv: daemon reachable under http://*:20033/FDK5NTQyiHRYvqGw/
[2020-09-02T00:15:11.579 CEST] [debug] git hash in /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles: 73ebc3acc3c3cd24f09d9f6c15edc470eba6e5eb
[2020-09-02T00:15:11.580 CEST] [debug] init needles from /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles
[2020-09-02T00:15:11.583 CEST] [info] Listening at "http://[::]:20033"
Web application available at http://[::]:20033
[2020-09-02T00:15:12.201 CEST] [debug] inst-console-20200224 contains inst-console twice
[2020-09-02T00:15:13.370 CEST] [debug] loaded 8184 needles
[2020-09-02T00:15:13.634 CEST] [debug] Blocking SIGTERM
[2020-09-02T00:15:13.721 CEST] [debug] Unblocking SIGTERM
5575: channel_out 15, channel_in 14
[2020-09-02T00:15:13.990 CEST] [debug] Blocking SIGTERM
[2020-09-02T00:15:14.122 CEST] [debug] Unblocking SIGTERM
5966: cmdpipe 13, rsppipe 16
[2020-09-02T00:15:14.125 CEST] [debug] started mgmt loop with pid 5966
remove_tree qemuscreenshot
[2020-09-02T00:15:14.382 CEST] [debug] qemu version detected: 3.1.1.1
[2020-09-02T00:15:14.709 CEST] [debug] running /usr/bin/chattr -f +C /var/lib/openqa/pool/3/raid
[2020-09-02T00:15:15.238 CEST] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/3/openSUSE-Jump-15.2-DVD-aarch64-Build72.2-Media.iso
[2020-09-02T00:15:15.443 CEST] [debug] Driver backend collected unknown process with pid 6596 and exit status: 0
[2020-09-02T00:15:15.477 CEST] [debug] backend process exited: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
[2020-09-02T00:15:15.485 CEST] [debug] stopping command server 5612 because test execution ended through exception
[2020-09-02T00:15:16.690 CEST] [debug] done with command server
[2020-09-02T00:15:16.690 CEST] [debug] stopping autotest process 5645
[2020-09-02T00:15:17.893 CEST] [debug] done with autotest process
5575: EXIT 1
[2020-09-02T00:15:18.0121 CEST] [info] Isotovideo exit status: 1
[2020-09-02T00:15:18.0153 CEST] [info] +++ worker notes +++
[2020-09-02T00:15:18.0154 CEST] [info] End time: 2020-09-01 22:15:18
[2020-09-02T00:15:18.0154 CEST] [info] Result: died
[2020-09-02T00:15:18.0238 CEST] [info] Uploading video.ogv
[2020-09-02T00:15:18.0319 CEST] [info] Uploading video_time.vtt
[2020-09-02T00:15:18.0355 CEST] [info] Uploading vars.json
[2020-09-02T00:15:18.0381 CEST] [info] Uploading autoinst-log.txt

The message isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. is saved as reason but I'm not sure whether it is really the source of the problem or just a symptom.


job reference: https://openqa.opensuse.org/tests/1380901

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#70876

Expected result

Problem does not happen anymore or if it happens there is more helpful output.

Suggestions

  • Check current test code coverage in os-autoinst for the error handling of backend/driver.pm:128 with the line $self->_send_json({cmd => 'start_vm'}) || die "failed to start VM";
  • Ensure the code is properly covered with tests
  • Try to reproduce the above problem
  • Improve feedback in these cases

Workaround

Retrigger

Actions #1

Updated by okurz over 3 years ago

  • Subject changed from Job incompletes with auto_review:"failed to start VM" to Job incompletes with auto_review:"failed to start VM":retry
  • Category set to Regressions/Crashes
  • Target version set to Ready

careful with the auto_review regex. If it is too generic it can catch many jobs. But I think the issue can be worked around by retriggering. THe assigned worker shows no problems in more recent jobs in https://openqa.opensuse.org/admin/workers/131 and the next job in the same scenario https://openqa.opensuse.org/tests/1381761#live also looks fine so far

Actions #2

Updated by okurz over 3 years ago

  • Subject changed from Job incompletes with auto_review:"failed to start VM":retry to Job incompletes with "failed to start VM" and no further detail
  • Difficulty set to hard

The message isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. is saved as reason but I'm not sure whether it is really the source of the problem or just a symptom.

that is really just a symptom. If you are interested in the current statistics of incompletes on o3 and osd you can find them now on https://gitlab.suse.de/openqa/auto-review/-/jobs/261093 and https://gitlab.suse.de/openqa/auto-review/-/jobs/261092 respectively. Most prominent is this ticket here. I think I was right in #70876#note-1 with "careful with the auto_review regex. If it is too generic it can catch many jobs." Because it did catch too many jobs that failed for a different reason, e.g. https://openqa.suse.de/tests/4716465 with

[2020-09-22T14:08:08.802 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H thunderx21-sp.qa.suse.de -U admin -P admin mc guid: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 65.

[2020-09-22T14:08:28.876 CEST] [debug] capture loop failed ipmitool -I lanplus -H thunderx21-sp.qa.suse.de -U admin -P admin chassis power off: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 65.

[2020-09-22T14:08:28.877 CEST] [debug] sending magic and exit
[2020-09-22T14:08:28.877 CEST] [debug] received magic close
[2020-09-22T14:08:28.881 CEST] [debug] backend process exited: 0
[2020-09-22T14:08:28.882 CEST] [debug] Driver backend collected unknown process with pid 30254 and exit status: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.

others are still comparable to the original job with no "Backend process died" included, e.g. https://openqa.opensuse.org/tests/1400622
showing

[2020-09-21T20:07:47.198 CEST] [debug] starting: /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -global isa-fdc.driveA= -m 1536 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot order=c -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :99,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/9/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/9/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0
[2020-09-21T20:07:47.202 CEST] [debug] Waiting for 0 attempts
[2020-09-21T20:07:47.327 CEST] [debug] backend process exited: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.

as the original problem is really vague and does not seem to reproduce often and as I failed to come up with a stable regex that prevents the mismatch I remove the regex again, e.g. (?s)(?<!sending magic and exit).*failed to start VM' with a negative look-behind assertion with following variable length did not work.

Actions #3

Updated by okurz over 3 years ago

  • Description updated (diff)
  • Status changed from New to Workable

Added a DB query to find recent reproducing jobs on production and found

   id    |     t_finished      |        test        |                                         reason                                          
---------+---------------------+--------------------+-----------------------------------------------------------------------------------------
 1400622 | 2020-09-21 18:07:54 | extra_tests_on_kde | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
 1380901 | 2020-09-01 22:15:18 | kde                | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
(2 rows)

   id    |     t_finished      |                  test                   |                                         reason                                          
---------+---------------------+-----------------------------------------+-----------------------------------------------------------------------------------------
 4794996 | 2020-10-08 23:12:06 | ha_supportserver_upgraded_sle12sp3_ltss | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
 4791627 | 2020-10-08 01:45:00 | ltp_cap_bounds                          | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
(2 rows)

since 2020-09-01. Made the ticket workable with my additions.

Actions #4

Updated by okurz over 3 years ago

  • Target version changed from Ready to future
Actions #5

Updated by okurz over 3 years ago

  • Subject changed from Job incompletes with "failed to start VM" and no further detail to Job incompletes with auto_review:"(?s)qemu version detected.*failed to start VM":retry and no further detail
  • Description updated (diff)

auto-review failed today due to https://openqa.opensuse.org/tests/1485939 with:

[2020-11-27T12:34:14.858 CET] [debug] started mgmt loop with pid 28732
[2020-11-27T12:34:14.908 CET] [debug] qemu version detected: 4.2.1
[2020-11-27T12:34:14.910 CET] [debug] running /usr/bin/chattr -f +C /var/lib/openqa/pool/5/raid
[2020-11-27T12:34:14.925 CET] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/5/obs-server.x86_64-2.10.51-qcow2-Build11.130.qcow2
[2020-11-27T12:34:14.995 CET] [debug] Driver backend collected unknown process with pid 28830 and exit status: 0
[2020-11-27T12:34:14.999 CET] [debug] backend process exited: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126.
[2020-11-27T12:34:14.999 CET] [debug] stopping command server 28666 because test execution ended through exception

the job in #70876#note-2 was a non-qemu one. Maybe we can use a regex that simply looks for qemu first?

Actions #6

Updated by okurz over 3 years ago

  • Description updated (diff)

looking on both o3 and osd I can currently find:

1485939|2020-11-27 11:34:15|done|incomplete|obs_appliance|isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126.|openqaworker4
5128997|2020-12-04 07:37:59|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128990|2020-12-04 07:37:51|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2
5128930|2020-12-04 06:20:40|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128923|2020-12-04 06:20:31|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2
5128898|2020-12-04 05:21:04|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128889|2020-12-04 05:20:58|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2
5128709|2020-12-04 04:20:46|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128702|2020-12-04 04:20:46|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128506|2020-12-04 03:20:41|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2
5128499|2020-12-04 03:20:41|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1

interesting log: https://openqa.suse.de/tests/5128990

Actions #7

Updated by okurz over 3 years ago

  • Subject changed from Job incompletes with auto_review:"(?s)qemu version detected.*failed to start VM":retry and no further detail to Job incompletes with auto_review:"(?s)qemu version detected.*Driver backend collected unknown process.*failed to start VM":retry and no further detail

The previous ones are invalid matches but likely we can add the additional message "Driver backend collected unknown process" in the middle to look for.

Actions #8

Updated by mkittler over 3 years ago

About job https://openqa.suse.de/tests/5128990 - it looks like a bug that isotovideo is even started considering the 404 error before the startup:

[2020-12-04T07:37:29.0143 UTC] [debug] [pid:84208] Found UEFI_PFLASH_VARS, caching sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2
[2020-12-04T07:37:29.0151 UTC] [info] [pid:84208] Downloading sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2, request #6176 sent to Cache Service
[2020-12-04T07:37:34.0266 UTC] [info] [pid:84208] Download of sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2 processed:
[info] [#6176] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#6176] Downloading "sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2" from "http://openqa.suse.de/tests/5128990/asset/hdd/sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2"
[info] [#6176] Download of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2" failed: 404 Not Found

[2020-12-04T07:37:34.0267 UTC] [error] [pid:84208] Failed to download sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2
Actions #9

Updated by mkittler over 3 years ago

Actually, we have special handling for UEFI_PFLASH_VARS assets. So this behavior is explicitly enabled by our code:

        if ($this_asset eq 'UEFI_PFLASH_VARS' && !defined $asset) {
            log_error("Failed to download $asset_uri", channels => 'autoinst');
            # assume that if we have a full path, that's what we should use
            $vars->{$this_asset} = $asset_uri if -e $asset_uri;
            # don't kill the job if the asset is not found
            # TODO: This seems to leave the job stuck in some cases (observed in production on openqaworker3).
            next;
        }

Maybe this code path should be harmonized with what we already have for the local asset detection:

next if (($key eq 'UEFI_PFLASH_VARS') and !$vars->{UEFI});

So we would already fail here on the worker level unless UEFI is really not set.

Actions

Also available in: Atom PDF