Project

General

Profile

action #70876

Job incompletes with auto_review:"(?s)qemu version detected.*Driver backend collected unknown process.*failed to start VM":retry and no further detail

Added by mkittler 8 months ago. Updated 4 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2020-09-02
Due date:
% Done:

0%

Estimated time:
Difficulty:
hard

Description

Observation

Not sure why this job incompleted:

[2020-09-02T00:15:10.754 CEST] [debug] Current version is 4.6.1598623827.924c995d [interface v20]
[2020-09-02T00:15:10.840 CEST] [debug] git hash in /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse: ab7782333c508e136760d4299a0c32297a5c38fc
[2020-09-02T00:15:11.276 CEST] [debug] usingenv DESKTOP=kde
…
[2020-09-02T00:15:11.278 CEST] [debug] usingenv NOIMAGES=1
[2020-09-02T00:15:11.294 CEST] [debug] scheduling isosize tests/installation/isosize.pm
…
[2020-09-02T00:15:11.520 CEST] [debug] scheduling shutdown tests/shutdown/shutdown.pm
[2020-09-02T00:15:11.577 CEST] [info] cmdsrv: daemon reachable under http://*:20033/FDK5NTQyiHRYvqGw/
[2020-09-02T00:15:11.579 CEST] [debug] git hash in /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles: 73ebc3acc3c3cd24f09d9f6c15edc470eba6e5eb
[2020-09-02T00:15:11.580 CEST] [debug] init needles from /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/products/opensuse/needles
[2020-09-02T00:15:11.583 CEST] [info] Listening at "http://[::]:20033"
Web application available at http://[::]:20033
[2020-09-02T00:15:12.201 CEST] [debug] inst-console-20200224 contains inst-console twice
[2020-09-02T00:15:13.370 CEST] [debug] loaded 8184 needles
[2020-09-02T00:15:13.634 CEST] [debug] Blocking SIGTERM
[2020-09-02T00:15:13.721 CEST] [debug] Unblocking SIGTERM
5575: channel_out 15, channel_in 14
[2020-09-02T00:15:13.990 CEST] [debug] Blocking SIGTERM
[2020-09-02T00:15:14.122 CEST] [debug] Unblocking SIGTERM
5966: cmdpipe 13, rsppipe 16
[2020-09-02T00:15:14.125 CEST] [debug] started mgmt loop with pid 5966
remove_tree qemuscreenshot
[2020-09-02T00:15:14.382 CEST] [debug] qemu version detected: 3.1.1.1
[2020-09-02T00:15:14.709 CEST] [debug] running /usr/bin/chattr -f +C /var/lib/openqa/pool/3/raid
[2020-09-02T00:15:15.238 CEST] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/3/openSUSE-Jump-15.2-DVD-aarch64-Build72.2-Media.iso
[2020-09-02T00:15:15.443 CEST] [debug] Driver backend collected unknown process with pid 6596 and exit status: 0
[2020-09-02T00:15:15.477 CEST] [debug] backend process exited: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
[2020-09-02T00:15:15.485 CEST] [debug] stopping command server 5612 because test execution ended through exception
[2020-09-02T00:15:16.690 CEST] [debug] done with command server
[2020-09-02T00:15:16.690 CEST] [debug] stopping autotest process 5645
[2020-09-02T00:15:17.893 CEST] [debug] done with autotest process
5575: EXIT 1
[2020-09-02T00:15:18.0121 CEST] [info] Isotovideo exit status: 1
[2020-09-02T00:15:18.0153 CEST] [info] +++ worker notes +++
[2020-09-02T00:15:18.0154 CEST] [info] End time: 2020-09-01 22:15:18
[2020-09-02T00:15:18.0154 CEST] [info] Result: died
[2020-09-02T00:15:18.0238 CEST] [info] Uploading video.ogv
[2020-09-02T00:15:18.0319 CEST] [info] Uploading video_time.vtt
[2020-09-02T00:15:18.0355 CEST] [info] Uploading vars.json
[2020-09-02T00:15:18.0381 CEST] [info] Uploading autoinst-log.txt

The message isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. is saved as reason but I'm not sure whether it is really the source of the problem or just a symptom.


job reference: https://openqa.opensuse.org/tests/1380901

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#70876

Expected result

Problem does not happen anymore or if it happens there is more helpful output.

Suggestions

  • Check current test code coverage in os-autoinst for the error handling of backend/driver.pm:128 with the line $self->_send_json({cmd => 'start_vm'}) || die "failed to start VM";
  • Ensure the code is properly covered with tests
  • Try to reproduce the above problem
  • Improve feedback in these cases

Workaround

Retrigger

History

#1 Updated by okurz 8 months ago

  • Subject changed from Job incompletes with auto_review:"failed to start VM" to Job incompletes with auto_review:"failed to start VM":retry
  • Category set to Concrete Bugs
  • Target version set to Ready

careful with the auto_review regex. If it is too generic it can catch many jobs. But I think the issue can be worked around by retriggering. THe assigned worker shows no problems in more recent jobs in https://openqa.opensuse.org/admin/workers/131 and the next job in the same scenario https://openqa.opensuse.org/tests/1381761#live also looks fine so far

#2 Updated by okurz 7 months ago

  • Subject changed from Job incompletes with auto_review:"failed to start VM":retry to Job incompletes with "failed to start VM" and no further detail
  • Difficulty set to hard

The message isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128. is saved as reason but I'm not sure whether it is really the source of the problem or just a symptom.

that is really just a symptom. If you are interested in the current statistics of incompletes on o3 and osd you can find them now on https://gitlab.suse.de/openqa/auto-review/-/jobs/261093 and https://gitlab.suse.de/openqa/auto-review/-/jobs/261092 respectively. Most prominent is this ticket here. I think I was right in #70876#note-1 with "careful with the auto_review regex. If it is too generic it can catch many jobs." Because it did catch too many jobs that failed for a different reason, e.g. https://openqa.suse.de/tests/4716465 with

[2020-09-22T14:08:08.802 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H thunderx21-sp.qa.suse.de -U admin -P admin mc guid: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 65.

[2020-09-22T14:08:28.876 CEST] [debug] capture loop failed ipmitool -I lanplus -H thunderx21-sp.qa.suse.de -U admin -P admin chassis power off: Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 65.

[2020-09-22T14:08:28.877 CEST] [debug] sending magic and exit
[2020-09-22T14:08:28.877 CEST] [debug] received magic close
[2020-09-22T14:08:28.881 CEST] [debug] backend process exited: 0
[2020-09-22T14:08:28.882 CEST] [debug] Driver backend collected unknown process with pid 30254 and exit status: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.

others are still comparable to the original job with no "Backend process died" included, e.g. https://openqa.opensuse.org/tests/1400622
showing

[2020-09-21T20:07:47.198 CEST] [debug] starting: /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -global isa-fdc.driveA= -m 1536 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot order=c -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :99,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/9/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/9/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0
[2020-09-21T20:07:47.202 CEST] [debug] Waiting for 0 attempts
[2020-09-21T20:07:47.327 CEST] [debug] backend process exited: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.

as the original problem is really vague and does not seem to reproduce often and as I failed to come up with a stable regex that prevents the mismatch I remove the regex again, e.g. (?s)(?<!sending magic and exit).*failed to start VM' with a negative look-behind assertion with following variable length did not work.

#3 Updated by okurz 6 months ago

  • Description updated (diff)
  • Status changed from New to Workable

Added a DB query to find recent reproducing jobs on production and found

   id    |     t_finished      |        test        |                                         reason                                          
---------+---------------------+--------------------+-----------------------------------------------------------------------------------------
 1400622 | 2020-09-21 18:07:54 | extra_tests_on_kde | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
 1380901 | 2020-09-01 22:15:18 | kde                | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
(2 rows)

   id    |     t_finished      |                  test                   |                                         reason                                          
---------+---------------------+-----------------------------------------+-----------------------------------------------------------------------------------------
 4794996 | 2020-10-08 23:12:06 | ha_supportserver_upgraded_sle12sp3_ltss | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
 4791627 | 2020-10-08 01:45:00 | ltp_cap_bounds                          | isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 128.
(2 rows)

since 2020-09-01. Made the ticket workable with my additions.

#4 Updated by okurz 6 months ago

  • Target version changed from Ready to future

#5 Updated by okurz 5 months ago

  • Subject changed from Job incompletes with "failed to start VM" and no further detail to Job incompletes with auto_review:"(?s)qemu version detected.*failed to start VM":retry and no further detail
  • Description updated (diff)

auto-review failed today due to https://openqa.opensuse.org/tests/1485939 with:

[2020-11-27T12:34:14.858 CET] [debug] started mgmt loop with pid 28732
[2020-11-27T12:34:14.908 CET] [debug] qemu version detected: 4.2.1
[2020-11-27T12:34:14.910 CET] [debug] running /usr/bin/chattr -f +C /var/lib/openqa/pool/5/raid
[2020-11-27T12:34:14.925 CET] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/5/obs-server.x86_64-2.10.51-qcow2-Build11.130.qcow2
[2020-11-27T12:34:14.995 CET] [debug] Driver backend collected unknown process with pid 28830 and exit status: 0
[2020-11-27T12:34:14.999 CET] [debug] backend process exited: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126.
[2020-11-27T12:34:14.999 CET] [debug] stopping command server 28666 because test execution ended through exception

the job in #70876#note-2 was a non-qemu one. Maybe we can use a regex that simply looks for qemu first?

#6 Updated by okurz 4 months ago

  • Description updated (diff)

looking on both o3 and osd I can currently find:

1485939|2020-11-27 11:34:15|done|incomplete|obs_appliance|isotovideo died: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126.|openqaworker4
5128997|2020-12-04 07:37:59|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128990|2020-12-04 07:37:51|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2
5128930|2020-12-04 06:20:40|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128923|2020-12-04 06:20:31|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2
5128898|2020-12-04 05:21:04|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128889|2020-12-04 05:20:58|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2
5128709|2020-12-04 04:20:46|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128702|2020-12-04 04:20:46|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1
5128506|2020-12-04 03:20:41|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-92.1@aarch64-virtio-minimal_with_sdk92.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-2
5128499|2020-12-04 03:20:41|done|incomplete|install_ltp+sle+Online+KOTD|backend died: qemu-img: Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': Could not open 'sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2': No such file or directory|openqaworker-arm-1

interesting log: https://openqa.suse.de/tests/5128990

#7 Updated by okurz 4 months ago

  • Subject changed from Job incompletes with auto_review:"(?s)qemu version detected.*failed to start VM":retry and no further detail to Job incompletes with auto_review:"(?s)qemu version detected.*Driver backend collected unknown process.*failed to start VM":retry and no further detail

The previous ones are invalid matches but likely we can add the additional message "Driver backend collected unknown process" in the middle to look for.

#8 Updated by mkittler 4 months ago

About job https://openqa.suse.de/tests/5128990 - it looks like a bug that isotovideo is even started considering the 404 error before the startup:

[2020-12-04T07:37:29.0143 UTC] [debug] [pid:84208] Found UEFI_PFLASH_VARS, caching sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2
[2020-12-04T07:37:29.0151 UTC] [info] [pid:84208] Downloading sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2, request #6176 sent to Cache Service
[2020-12-04T07:37:34.0266 UTC] [info] [pid:84208] Download of sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2 processed:
[info] [#6176] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#6176] Downloading "sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2" from "http://openqa.suse.de/tests/5128990/asset/hdd/sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2"
[info] [#6176] Download of "/var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2" failed: 404 Not Found

[2020-12-04T07:37:34.0267 UTC] [error] [pid:84208] Failed to download sle-15-SP3-aarch64-95.1@aarch64-virtio-minimal_with_sdk95.1_installed-uefi-vars.qcow2

#9 Updated by mkittler 4 months ago

Actually, we have special handling for UEFI_PFLASH_VARS assets. So this behavior is explicitly enabled by our code:

        if ($this_asset eq 'UEFI_PFLASH_VARS' && !defined $asset) {
            log_error("Failed to download $asset_uri", channels => 'autoinst');
            # assume that if we have a full path, that's what we should use
            $vars->{$this_asset} = $asset_uri if -e $asset_uri;
            # don't kill the job if the asset is not found
            # TODO: This seems to leave the job stuck in some cases (observed in production on openqaworker3).
            next;
        }

Maybe this code path should be harmonized with what we already have for the local asset detection:

next if (($key eq 'UEFI_PFLASH_VARS') and !$vars->{UEFI});

So we would already fail here on the worker level unless UEFI is really not set.

Also available in: Atom PDF