action #99153
Updated by okurz about 3 years ago
## Observation
There are many incomplete jobs on OSD, please see: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1632278812298&to=1632451612298&viewPanel=17
```
7211384 | offline_sles15sp1_ltss_media_basesys-srv-desk-dev-contm-lgm-py2-wsm_all_full | aarch64 | 2021-09-24 00:07:46 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7211514 | install_ltp+sle+Server-DVD-Incidents-Kernel-KOTD | s390x-kvm-sle12 | 2021-09-24 00:14:42 | incomplete | b
ackend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.
7211282 | online_sles15sp2_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 00:30:46 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7211232 | online_sles15sp1_ltss_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 00:31:00 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7212147 | offline_sles12sp5_pscc_sdk-tcm-wsm_all_full:investigate:retry | aarch64 | 2021-09-24 00:32:15 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7212048 | offline_sles15sp2_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 00:48:34 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7207535 | qam-yast_self_update+15 | uefi | 2021-09-24 01:12:50 | incomplete | c
ache failure: Cache service queue already full (10)
7208023 | mru-install-multipath-remote_supportserver | 64bit | 2021-09-24 01:12:51 | incomplete | c
ache failure: Cache service queue already full (10)
7208045 | qam-textmode+sle15 | 64bit | 2021-09-24 01:12:51 | incomplete | c
ache failure: Cache service queue already full (10)
7207737 | create_hdd_minimal_base+sdk+python2 | 64bit | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
7208073 | lvm_thin_provisioning | 64bit | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
7208237 | sle-15-SP3_image_on_sle-12-SP5_host_docker | 64bit | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
7207741 | mru-install-desktop-with-addons | 64bit | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
7208022 | mru-install-minimal-with-addons-multipath | 64bit | 2021-09-24 01:12:54 | incomplete | c
ache failure: Cache service queue already full (10)
7208289 | yast_no_self_update | 64bit | 2021-09-24 01:13:00 | incomplete | c
ache failure: Cache service queue already full (10)
7208232 | sle-15-SP3_image_on_sle-15-SP3_host_docker | 64bit | 2021-09-24 01:13:00 | incomplete | c
ache failure: Cache service queue already full (10)
7207758 | qam-gnome | 64bit | 2021-09-24 01:13:01 | incomplete | c
... ...
7213920 | online_sles15sp1_ltss_pscc_base_all_minimal_zypp | 64bit_cirrus | 2021-09-24 01:41:16 | incomplete | cache failure: Cache service queue already full (10)
7208973 | qam_ha_qdevice_node2 | 64bit | 2021-09-24 01:41:23 | incomplete | backend died: QEMU exited unexpectedly, see log for details
7209390 | qam_3nodes_node01 | 64bit | 2021-09-24 01:43:42 | incomplete | backend died: QEMU exited unexpectedly, see log for details
7209465 | mau-webserver | 64bit | 2021-09-24 01:43:45 | incomplete | cache failure: Cache service queue already full (10)
7213653 | qam-gnome | s390x-kvm-sle12 | 2021-09-24 01:44:31 | incomplete | backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out
7209000 | qam_ha_priority_fencing_node01 | 64bit | 2021-09-24 01:46:55 | incomplete | backend died: QEMU exited unexpectedly, see log for details
7209381 | qam_ha_priority_fencing_node02 | 64bit | 2021-09-24 01:48:18 | incomplete | cache failure: Cache service queue already full (10)
7211405 | offline_sles15sp1_ltss_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 01:49:28 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7213816 | qam-gnome | s390x-kvm-sle12 | 2021-09-24 01:51:14 | incomplete | backend died: Error connecting to VNC server <10.161.145.80:5901>: IO::Socket::INET: connect: Connection timed out
7213652 | qam-minimal+base | s390x-kvm-sle12 | 2021-09-24 01:58:28 | incomplete | backend died: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out
7212213 | offline_sles15sp2_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 02:02:46 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7213165 | qam-minimal+base | s390x-kvm-sle12 | 2021-09-24 02:07:04 | incomplete | backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out
7213815 | qam-minimal+base | s390x-kvm-sle12 | 2021-09-24 02:07:06 | incomplete | backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out
7213167 | mru-install-minimal-with-addons | s390x-kvm-sle12 | 2021-09-24 02:13:49 | incomplete | backend died: Error connecting to VNC server <10.161.145.91:5901>: IO::Socket::INET: connect: Connection timed out
7212153 | online_sles15sp3_pscc_lp-basesys-srv-desk-dev-contm-lgm-tsm-wsm_all_full:investigate:retry | aarch64 | 2021-09-24 02:14:56 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7213137 | qam-gnome | s390x-kvm-sle15 | 2021-09-24 02:48:58 | incomplete | backend died: Error connecting to VNC server <10.161.145.90:5901>: IO::Socket::INET: connect: Connection timed out
7212197 | online_sles15sp2_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 02:54:19 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7213903 | ext4_staging_s390x | s390x-kvm-sle12 | 2021-09-24 02:58:03 | incomplete | backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out
```
Checked some jobs with `backend died: QEMU exited unexpectedly, see log for details`, in these jobs' autoinst-log.txt, show:
```
[2021-09-24T03:41:22.180 CEST] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
QEMU terminated before QMP connection could be established. Check for errors below
[2021-09-24T03:41:22.180 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2021-09-24T03:41:22.181 CEST] [debug] Passing remaining frames to the video encoder
[2021-09-24T03:41:22.248 CEST] [debug] Waiting for video encoder to finalize the video
[2021-09-24T03:41:22.248 CEST] [debug] The built-in video encoder (pid 59450) terminated
[2021-09-24T03:41:22.250 CEST] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-09-24T03:41:22.250 CEST] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-09-24T03:41:22.250 CEST] [warn] !!! : qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on: Could not open backing file: Image is not in qcow2 format
```
## Suggestions
- Not related to #98901
- qemu says `Could not open backing file: Image is not in qcow2 format`
- Check what recent changes wrt qemu use could have caused this
- Verify if we broke qemu 4.2.1 by supporting 6.0
- Consider the relation to #98727
- Add automatic restarting for known non-critical issues, assuming this issue is flaky
- Ensure that relevant, i.e. "most", scenarios are handled and incomplete jobs (also in the past) are handled, e.g. retriggered and fixed
- Unpause alerts
## Rollback steps
* Unpause alert and verify that it passes