Project

General

Profile

Actions

action #99153

closed

[Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24

Added by Xiaojing_liu about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-09-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

There are many incomplete jobs on OSD, please see: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1632278812298&to=1632451612298&viewPanel=17

 7211384 | offline_sles15sp1_ltss_media_basesys-srv-desk-dev-contm-lgm-py2-wsm_all_full               | aarch64         | 2021-09-24 00:07:46 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7211514 | install_ltp+sle+Server-DVD-Incidents-Kernel-KOTD                                           | s390x-kvm-sle12 | 2021-09-24 00:14:42 | incomplete | b
ackend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.
 7211282 | online_sles15sp2_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full                  | aarch64         | 2021-09-24 00:30:46 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7211232 | online_sles15sp1_ltss_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full             | aarch64         | 2021-09-24 00:31:00 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7212147 | offline_sles12sp5_pscc_sdk-tcm-wsm_all_full:investigate:retry                              | aarch64         | 2021-09-24 00:32:15 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7212048 | offline_sles15sp2_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full              | aarch64         | 2021-09-24 00:48:34 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7207535 | qam-yast_self_update+15                                                                    | uefi            | 2021-09-24 01:12:50 | incomplete | c
ache failure: Cache service queue already full (10)
 7208023 | mru-install-multipath-remote_supportserver                                                 | 64bit           | 2021-09-24 01:12:51 | incomplete | c
ache failure: Cache service queue already full (10)
 7208045 | qam-textmode+sle15                                                                         | 64bit           | 2021-09-24 01:12:51 | incomplete | c
ache failure: Cache service queue already full (10)
 7207737 | create_hdd_minimal_base+sdk+python2                                                        | 64bit           | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
 7208073 | lvm_thin_provisioning                                                                      | 64bit           | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
 7208237 | sle-15-SP3_image_on_sle-12-SP5_host_docker                                                 | 64bit           | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
 7207741 | mru-install-desktop-with-addons                                                            | 64bit           | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
 7208022 | mru-install-minimal-with-addons-multipath                                                  | 64bit           | 2021-09-24 01:12:54 | incomplete | c
ache failure: Cache service queue already full (10)
 7208289 | yast_no_self_update                                                                        | 64bit           | 2021-09-24 01:13:00 | incomplete | c
ache failure: Cache service queue already full (10)
 7208232 | sle-15-SP3_image_on_sle-15-SP3_host_docker                                                 | 64bit           | 2021-09-24 01:13:00 | incomplete | c
ache failure: Cache service queue already full (10)
 7207758 | qam-gnome                                                                                  | 64bit           | 2021-09-24 01:13:01 | incomplete | c
... ...
 7213920 | online_sles15sp1_ltss_pscc_base_all_minimal_zypp                                           | 64bit_cirrus    | 2021-09-24 01:41:16 | incomplete | cache failure: Cache service queue already full (10)
 7208973 | qam_ha_qdevice_node2                                                                       | 64bit           | 2021-09-24 01:41:23 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 7209390 | qam_3nodes_node01                                                                          | 64bit           | 2021-09-24 01:43:42 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 7209465 | mau-webserver                                                                              | 64bit           | 2021-09-24 01:43:45 | incomplete | cache failure: Cache service queue already full (10)
 7213653 | qam-gnome                                                                                  | s390x-kvm-sle12 | 2021-09-24 01:44:31 | incomplete | backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out
 7209000 | qam_ha_priority_fencing_node01                                                             | 64bit           | 2021-09-24 01:46:55 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 7209381 | qam_ha_priority_fencing_node02                                                             | 64bit           | 2021-09-24 01:48:18 | incomplete | cache failure: Cache service queue already full (10)
 7211405 | offline_sles15sp1_ltss_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full            | aarch64         | 2021-09-24 01:49:28 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7213816 | qam-gnome                                                                                  | s390x-kvm-sle12 | 2021-09-24 01:51:14 | incomplete | backend died: Error connecting to VNC server <10.161.145.80:5901>: IO::Socket::INET: connect: Connection timed out
 7213652 | qam-minimal+base                                                                           | s390x-kvm-sle12 | 2021-09-24 01:58:28 | incomplete | backend died: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out
 7212213 | offline_sles15sp2_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full              | aarch64         | 2021-09-24 02:02:46 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7213165 | qam-minimal+base                                                                           | s390x-kvm-sle12 | 2021-09-24 02:07:04 | incomplete | backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out
 7213815 | qam-minimal+base                                                                           | s390x-kvm-sle12 | 2021-09-24 02:07:06 | incomplete | backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out
 7213167 | mru-install-minimal-with-addons                                                            | s390x-kvm-sle12 | 2021-09-24 02:13:49 | incomplete | backend died: Error connecting to VNC server <10.161.145.91:5901>: IO::Socket::INET: connect: Connection timed out
 7212153 | online_sles15sp3_pscc_lp-basesys-srv-desk-dev-contm-lgm-tsm-wsm_all_full:investigate:retry | aarch64         | 2021-09-24 02:14:56 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7213137 | qam-gnome                                                                                  | s390x-kvm-sle15 | 2021-09-24 02:48:58 | incomplete | backend died: Error connecting to VNC server <10.161.145.90:5901>: IO::Socket::INET: connect: Connection timed out
 7212197 | online_sles15sp2_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full                  | aarch64         | 2021-09-24 02:54:19 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
 7213903 | ext4_staging_s390x                                                                         | s390x-kvm-sle12 | 2021-09-24 02:58:03 | incomplete | backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out

Checked some jobs with backend died: QEMU exited unexpectedly, see log for details, in these jobs' autoinst-log.txt, show:

[2021-09-24T03:41:22.180 CEST] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  QEMU terminated before QMP connection could be established. Check for errors below
[2021-09-24T03:41:22.180 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2021-09-24T03:41:22.181 CEST] [debug] Passing remaining frames to the video encoder
[2021-09-24T03:41:22.248 CEST] [debug] Waiting for video encoder to finalize the video
[2021-09-24T03:41:22.248 CEST] [debug] The built-in video encoder (pid 59450) terminated
[2021-09-24T03:41:22.250 CEST] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-09-24T03:41:22.250 CEST] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-09-24T03:41:22.250 CEST] [warn] !!! : qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on: Could not open backing file: Image is not in qcow2 format

Suggestions

  • Not related to #98901
  • qemu says Could not open backing file: Image is not in qcow2 format
  • Check what recent changes wrt qemu use could have caused this
  • Verify if we broke qemu 4.2.1 by supporting 6.0
  • Consider the relation to #98727
  • Add automatic restarting for known non-critical issues, assuming this issue is flaky
  • Ensure that relevant, i.e. "most", scenarios are handled and incomplete jobs (also in the past) are handled, e.g. retriggered and fixed
  • Unpause alerts

Rollback steps

  • Unpause alert and verify that it passes

Related issues 2 (1 open1 closed)

Related to openQA Project (public) - action #99396: Incompletes with auto_review:"api failure: Failed to register .* 503":retry should be restarted automaticallyResolvedokurz2021-09-282021-10-12

Actions
Related to openQA Project (public) - action #99399: openQA (or auto-review) can restart the parent job in case of certain incompletesNew2021-09-28

Actions
Actions #1

Updated by livdywan about 3 years ago

osd-deployment / monitor-post-deploy failed because of Incomplete jobs (not restarted) of last 24h alert

Actions #2

Updated by livdywan about 3 years ago

Maybe decreased network performance due to bigger images reported in #99246 slowed these down. There's also incompletes on malbec #99117.

Also, [OK] Incomplete jobs (not restarted) of last 24h alert 16.18 CEST today

Actions #3

Updated by okurz about 3 years ago

cdywan wrote:

Maybe decreased network performance due to bigger images reported in #99246 slowed these down. There's also incompletes on malbec #99117.

Very unlikely. The regression was from beginning of August

Also, [OK] Incomplete jobs (not restarted) of last 24h alert 16.18 CEST today

Means that some incompletes are now older than 24h

Actions #4

Updated by okurz about 3 years ago

  • Target version set to Ready
Actions #5

Updated by livdywan about 3 years ago

  • Subject changed from [Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24 to [Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24 size:
  • Description updated (diff)
  • Status changed from New to Workable

I added some suggestions, but since there seem to be different issues I'll try and split this up into dedicated tickets

Actions #6

Updated by osukup about 3 years ago

2021-09-24T03:41:22.250 CEST] [warn] !!! : qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on: Could not open backing file: Image is not in qcow2 format

after some checks it looks like backing file from parent job was damaged during convert or upload :(

Actions #7

Updated by mkittler about 3 years ago

  • Assignee set to mkittler
Actions #8

Updated by mkittler about 3 years ago

The alert is already off again. It looks like a mix of everything and incompletes happened across all workers, e.g. see output of select id, t_finished, result, substring(reason from 0 for 30) as short_reason, (select host from workers where id = assigned_worker_id) as worker from jobs where result = 'incomplete' and t_finished >= '2021-09-23T18:00:00' and clone_id is null order by t_finished;.

However, it looks like most of the incompletes were asset download failures, followed by "backend died: Error connecting to VNC server", followed by various backend errors, followed by API failures. Many of the API failures were 503 ones so the web UI was shortly not responding/overloaded. That would also explain failing asset downloads.

openqa=> select count(id), substring(reason from 0 for 35) as short_reason from jobs where result = 'incomplete' and t_finished >= '2021-09-23T18:00:00' and clone_id is null group by short_reason order by count desc;
 count |            short_reason            
-------+------------------------------------
   215 | asset failure: Failed to download 
   184 | backend died: Error connecting to 
   100 | backend died: QEMU exited unexpect
    42 | api failure: Failed to register at
    18 | died: terminated prematurely, see 
     8 | tests died: unable to load main.pm
     4 | isotovideo died: Could not find 'c
     4 | backend died: QMP command migrate 
     3 | isotovideo died: Could not find '2
     2 | tests died: unable to load tests/i
     2 | backend died: Lost SSH connection 
     1 | backend died: xfreerdp did not sta
     1 | isotovideo died: Could not find '6
     1 | backend died: backend method not i
     1 | backend died: QEMU terminated befo
     1 | backend died: Bizarre copy of HASH

If I reduce the query to the biggest increase (from 200 to 300 incompletes in the last 24 hours) it looks like this:

openqa=> select count(id), substring(reason from 0 for 57) as short_reason, array_agg(id) from jobs where result = 'incomplete' and t_finished >= '2021-09-23T18:00:00' and t_finished <= '2021-09-23T21:00:00' and clone_id is null group by short_reason order by count desc;
 count |                       short_reason                       |                                                                                                                                                                     array_agg                                                                           

-------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
    42 | api failure: Failed to register at openqa.suse.de - 503  | {7204353,7207323,7201540,7207160,7204230,7207707,7203564,7209676,7203536,7201509,7203540,7203764,7209607,7207133,7209699,7204370,7203544,7204351,7209934,7204372,7201487,7203543,7203998,7203561,7201512,7201542,7201489,7204373,7204371,7203542,7203444
,7203694,7204368,7204369,7203537,7203541,7209649,7203562,7203545,7201527,7204354,7204350}
    39 | backend died: QEMU exited unexpectedly, see log for deta | {7204636,7204816,7204510,7204522,7204523,7204528,7204530,7204578,7204579,7204595,7204602,7204804,7204486,7209393,7204577,7204594,7204508,7204516,7204517,7204511,7204515,7204498,7204513,7204507,7204512,7204549,7204817,7204532,7204525,7204593,7204779
,7204589,7204576,7204590,7204616,7204637,7204638,7204806,7204235}
     7 | backend died: Error connecting to VNC server <10.161.145 | {7202043,7201998,7202037,7202008,7202019,7202044,7210830}
     4 | asset failure: Failed to download sle-15-SP3-s390x-ha-al | {7201958,7201980,7201981,7201959}
     2 | backend died: QMP command migrate failed: GenericError;  | {7203901,7204635}
     1 | asset failure: Failed to download SLE-12-SP3-Server-DVD- | {7210400}
     1 | tests died: unable to load main.pm, check the log for th | {7205150} 

So I would say the immediate cause was API failures because the web UI was not responsive for too long.

Actions #9

Updated by okurz about 3 years ago

Good investigation. So what do you suggest to do so that reviewers don't need to manually retrigger?

Actions #10

Updated by mkittler about 3 years ago

By the way, that's an example job caused by the web UI being unresponsive (reason is api failure: …):

[2021-09-23T20:24:23.0994 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:23.0995 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 502 response: Proxy Error
[2021-09-23T20:24:23.0995 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:24.0927 CEST] [warn] [pid:43911] Websocket connection to http://openqa.suse.de/api/v1/ws/1275 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-23T20:24:29.0018 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:29.0018 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 503 response: Service Unavailable
[2021-09-23T20:24:29.0019 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:34.0036 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:34.0037 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 503 response: Service Unavailable
[2021-09-23T20:24:34.0037 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:34.0937 CEST] [info] [pid:43911] Registering with openQA openqa.suse.de
[2021-09-23T20:24:34.0944 CEST] [warn] [pid:43911] Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.43 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>
 - trying again in 10 seconds
[2021-09-23T20:24:39.0096 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:39.0096 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 503 response: Service Unavailable
[2021-09-23T20:24:39.0097 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:44.0119 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:44.0119 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 503 response: Service Unavailable
[2021-09-23T20:24:44.0120 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:44.0951 CEST] [info] [pid:43911] Registering with openQA openqa.suse.de
[2021-09-23T20:24:44.0958 CEST] [warn] [pid:43911] Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.43 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>
 - trying again in 10 seconds
[2021-09-23T20:24:49.0120 CEST] [info] [pid:59371] upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: Processing chunk 631/4310, avg. speed ~37.560 KiB/s
[2021-09-23T20:24:49.0121 CEST] [error] [pid:59371] Upload failed, and all retry attempts have been exhausted
[2021-09-23T20:24:49.0603 CEST] [info] [pid:59371] Uploading video.ogv

Here the web UI looks clearly overloaded, e.g. certain asset uploads fail, then the worker re-registration but then it was still possible to upload the log and reason.


I also checked a bunch of backend died: QEMU exited unexpectedly, see log … jobs and they all looked like this:

[2021-09-23T20:33:47.508 CEST] [warn] !!! : qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on: Could not open backing file: Image is not in qcow2 format

As @osukup has already mentioned the qcow2 image is possibly damaged. Maybe that's related to the web UI being overloaded?


Judging by the GRU logs the asset cleanup was running at the same time so the load on assets drive was certainly high. Unfortunately logs of the main openQA service aren't available anymore.

Actions #11

Updated by okurz about 3 years ago

  • Category set to Regressions/Crashes

okurz wrote:

Good investigation. So what do you suggest to do so that reviewers don't need to manually retrigger?

So in general I think our approach should be retrying in all the cases where it could help. In the end users would not do anything different but retrigger jobs, right?

Actions #12

Updated by mkittler about 3 years ago

  • Related to action #99396: Incompletes with auto_review:"api failure: Failed to register .* 503":retry should be restarted automatically added
Actions #13

Updated by mkittler about 3 years ago

  • Related to action #99399: openQA (or auto-review) can restart the parent job in case of certain incompletes added
Actions #14

Updated by okurz about 3 years ago

  • Subject changed from [Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24 size: to [Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24
  • Description updated (diff)
  • Status changed from Workable to In Progress

Paused alerts, extended suggestions

Actions #17

Updated by mkittler about 3 years ago

  • Status changed from In Progress to Feedback

I've been running ./openqa-monitor-investigation-candidates and ./openqa-label-known-issues to restart the incompletes for #99396. I've checked a few job IDs and they have been restarted, indeed.

Currently there are still too many incompletes so the alert fired again but has been paused. The number of incompletes increased again shortly after '2021-09-27T20:00:00'. However, almost all of them are because tests died and indeed look irrelevant to us (a YAML file is missing). So the alert should turn off when https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/591 is merged. Then I can resume the alert again.

Actions #18

Updated by okurz about 3 years ago

  • Status changed from Feedback to Resolved

there you go: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=17&orgId=1&from=1632850722713&to=1632854049029

I resumed the alert and it's fine. As we have enough follow-up tasks identified we are good to call this resolved.

Actions

Also available in: Atom PDF