Project

General

Profile

Actions

action #99153

closed

[Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24

Added by Xiaojing_liu over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-09-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

There are many incomplete jobs on OSD, please see: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1632278812298&to=1632451612298&viewPanel=17

 7211384 | offline_sles15sp1_ltss_media_basesys-srv-desk-dev-contm-lgm-py2-wsm_all_full               | aarch64         | 2021-09-24 00:07:46 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7211514 | install_ltp+sle+Server-DVD-Incidents-Kernel-KOTD                                           | s390x-kvm-sle12 | 2021-09-24 00:14:42 | incomplete | b
ackend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.
 7211282 | online_sles15sp2_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full                  | aarch64         | 2021-09-24 00:30:46 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7211232 | online_sles15sp1_ltss_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full             | aarch64         | 2021-09-24 00:31:00 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7212147 | offline_sles12sp5_pscc_sdk-tcm-wsm_all_full:investigate:retry                              | aarch64         | 2021-09-24 00:32:15 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7212048 | offline_sles15sp2_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full              | aarch64         | 2021-09-24 00:48:34 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7207535 | qam-yast_self_update+15                                                                    | uefi            | 2021-09-24 01:12:50 | incomplete | c
ache failure: Cache service queue already full (10)
 7208023 | mru-install-multipath-remote_supportserver                                                 | 64bit           | 2021-09-24 01:12:51 | incomplete | c
ache failure: Cache service queue already full (10)
 7208045 | qam-textmode+sle15                                                                         | 64bit           | 2021-09-24 01:12:51 | incomplete | c
ache failure: Cache service queue already full (10)
 7207737 | create_hdd_minimal_base+sdk+python2                                                        | 64bit           | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
 7208073 | lvm_thin_provisioning                                                                      | 64bit           | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
 7208237 | sle-15-SP3_image_on_sle-12-SP5_host_docker                                                 | 64bit           | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
 7207741 | mru-install-desktop-with-addons                                                            | 64bit           | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
 7208022 | mru-install-minimal-with-addons-multipath                                                  | 64bit           | 2021-09-24 01:12:54 | incomplete | c
ache failure: Cache service queue already full (10)
 7208289 | yast_no_self_update                                                                        | 64bit           | 2021-09-24 01:13:00 | incomplete | c
ache failure: Cache service queue already full (10)
 7208232 | sle-15-SP3_image_on_sle-15-SP3_host_docker                                                 | 64bit           | 2021-09-24 01:13:00 | incomplete | c
ache failure: Cache service queue already full (10)
 7207758 | qam-gnome                                                                                  | 64bit           | 2021-09-24 01:13:01 | incomplete | c
... ...
 7213920 | online_sles15sp1_ltss_pscc_base_all_minimal_zypp                                           | 64bit_cirrus    | 2021-09-24 01:41:16 | incomplete | cache failure: Cache service queue already full (10)
 7208973 | qam_ha_qdevice_node2                                                                       | 64bit           | 2021-09-24 01:41:23 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 7209390 | qam_3nodes_node01                                                                          | 64bit           | 2021-09-24 01:43:42 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 7209465 | mau-webserver                                                                              | 64bit           | 2021-09-24 01:43:45 | incomplete | cache failure: Cache service queue already full (10)
 7213653 | qam-gnome                                                                                  | s390x-kvm-sle12 | 2021-09-24 01:44:31 | incomplete | backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out
 7209000 | qam_ha_priority_fencing_node01                                                             | 64bit           | 2021-09-24 01:46:55 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 7209381 | qam_ha_priority_fencing_node02                                                             | 64bit           | 2021-09-24 01:48:18 | incomplete | cache failure: Cache service queue already full (10)
 7211405 | offline_sles15sp1_ltss_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full            | aarch64         | 2021-09-24 01:49:28 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7213816 | qam-gnome                                                                                  | s390x-kvm-sle12 | 2021-09-24 01:51:14 | incomplete | backend died: Error connecting to VNC server <10.161.145.80:5901>: IO::Socket::INET: connect: Connection timed out
 7213652 | qam-minimal+base                                                                           | s390x-kvm-sle12 | 2021-09-24 01:58:28 | incomplete | backend died: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out
 7212213 | offline_sles15sp2_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full              | aarch64         | 2021-09-24 02:02:46 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7213165 | qam-minimal+base                                                                           | s390x-kvm-sle12 | 2021-09-24 02:07:04 | incomplete | backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out
 7213815 | qam-minimal+base                                                                           | s390x-kvm-sle12 | 2021-09-24 02:07:06 | incomplete | backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out
 7213167 | mru-install-minimal-with-addons                                                            | s390x-kvm-sle12 | 2021-09-24 02:13:49 | incomplete | backend died: Error connecting to VNC server <10.161.145.91:5901>: IO::Socket::INET: connect: Connection timed out
 7212153 | online_sles15sp3_pscc_lp-basesys-srv-desk-dev-contm-lgm-tsm-wsm_all_full:investigate:retry | aarch64         | 2021-09-24 02:14:56 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7213137 | qam-gnome                                                                                  | s390x-kvm-sle15 | 2021-09-24 02:48:58 | incomplete | backend died: Error connecting to VNC server <10.161.145.90:5901>: IO::Socket::INET: connect: Connection timed out
 7212197 | online_sles15sp2_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full                  | aarch64         | 2021-09-24 02:54:19 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7213903 | ext4_staging_s390x                                                                         | s390x-kvm-sle12 | 2021-09-24 02:58:03 | incomplete | backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out

Checked some jobs with backend died: QEMU exited unexpectedly, see log for details, in these jobs' autoinst-log.txt, show:

[2021-09-24T03:41:22.180 CEST] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  QEMU terminated before QMP connection could be established. Check for errors below
[2021-09-24T03:41:22.180 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2021-09-24T03:41:22.181 CEST] [debug] Passing remaining frames to the video encoder
[2021-09-24T03:41:22.248 CEST] [debug] Waiting for video encoder to finalize the video
[2021-09-24T03:41:22.248 CEST] [debug] The built-in video encoder (pid 59450) terminated
[2021-09-24T03:41:22.250 CEST] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-09-24T03:41:22.250 CEST] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-09-24T03:41:22.250 CEST] [warn] !!! : qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on: Could not open backing file: Image is not in qcow2 format

Suggestions

  • Not related to #98901
  • qemu says Could not open backing file: Image is not in qcow2 format
  • Check what recent changes wrt qemu use could have caused this
  • Verify if we broke qemu 4.2.1 by supporting 6.0
  • Consider the relation to #98727
  • Add automatic restarting for known non-critical issues, assuming this issue is flaky
  • Ensure that relevant, i.e. "most", scenarios are handled and incomplete jobs (also in the past) are handled, e.g. retriggered and fixed
  • Unpause alerts

Rollback steps

  • Unpause alert and verify that it passes

Related issues 2 (1 open1 closed)

Related to openQA Project - action #99396: Incompletes with auto_review:"api failure: Failed to register .* 503":retry should be restarted automaticallyResolvedokurz2021-09-282021-10-12

Actions
Related to openQA Project - action #99399: openQA (or auto-review) can restart the parent job in case of certain incompletesNew2021-09-28

Actions
Actions

Also available in: Atom PDF