action #99153
closed[Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24
Description
Observation¶
There are many incomplete jobs on OSD, please see: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1632278812298&to=1632451612298&viewPanel=17
7211384 | offline_sles15sp1_ltss_media_basesys-srv-desk-dev-contm-lgm-py2-wsm_all_full | aarch64 | 2021-09-24 00:07:46 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7211514 | install_ltp+sle+Server-DVD-Incidents-Kernel-KOTD | s390x-kvm-sle12 | 2021-09-24 00:14:42 | incomplete | b
ackend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.
7211282 | online_sles15sp2_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 00:30:46 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7211232 | online_sles15sp1_ltss_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 00:31:00 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7212147 | offline_sles12sp5_pscc_sdk-tcm-wsm_all_full:investigate:retry | aarch64 | 2021-09-24 00:32:15 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7212048 | offline_sles15sp2_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 00:48:34 | incomplete | b
ackend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7207535 | qam-yast_self_update+15 | uefi | 2021-09-24 01:12:50 | incomplete | c
ache failure: Cache service queue already full (10)
7208023 | mru-install-multipath-remote_supportserver | 64bit | 2021-09-24 01:12:51 | incomplete | c
ache failure: Cache service queue already full (10)
7208045 | qam-textmode+sle15 | 64bit | 2021-09-24 01:12:51 | incomplete | c
ache failure: Cache service queue already full (10)
7207737 | create_hdd_minimal_base+sdk+python2 | 64bit | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
7208073 | lvm_thin_provisioning | 64bit | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
7208237 | sle-15-SP3_image_on_sle-12-SP5_host_docker | 64bit | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
7207741 | mru-install-desktop-with-addons | 64bit | 2021-09-24 01:12:52 | incomplete | c
ache failure: Cache service queue already full (10)
7208022 | mru-install-minimal-with-addons-multipath | 64bit | 2021-09-24 01:12:54 | incomplete | c
ache failure: Cache service queue already full (10)
7208289 | yast_no_self_update | 64bit | 2021-09-24 01:13:00 | incomplete | c
ache failure: Cache service queue already full (10)
7208232 | sle-15-SP3_image_on_sle-15-SP3_host_docker | 64bit | 2021-09-24 01:13:00 | incomplete | c
ache failure: Cache service queue already full (10)
7207758 | qam-gnome | 64bit | 2021-09-24 01:13:01 | incomplete | c
... ...
7213920 | online_sles15sp1_ltss_pscc_base_all_minimal_zypp | 64bit_cirrus | 2021-09-24 01:41:16 | incomplete | cache failure: Cache service queue already full (10)
7208973 | qam_ha_qdevice_node2 | 64bit | 2021-09-24 01:41:23 | incomplete | backend died: QEMU exited unexpectedly, see log for details
7209390 | qam_3nodes_node01 | 64bit | 2021-09-24 01:43:42 | incomplete | backend died: QEMU exited unexpectedly, see log for details
7209465 | mau-webserver | 64bit | 2021-09-24 01:43:45 | incomplete | cache failure: Cache service queue already full (10)
7213653 | qam-gnome | s390x-kvm-sle12 | 2021-09-24 01:44:31 | incomplete | backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out
7209000 | qam_ha_priority_fencing_node01 | 64bit | 2021-09-24 01:46:55 | incomplete | backend died: QEMU exited unexpectedly, see log for details
7209381 | qam_ha_priority_fencing_node02 | 64bit | 2021-09-24 01:48:18 | incomplete | cache failure: Cache service queue already full (10)
7211405 | offline_sles15sp1_ltss_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 01:49:28 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7213816 | qam-gnome | s390x-kvm-sle12 | 2021-09-24 01:51:14 | incomplete | backend died: Error connecting to VNC server <10.161.145.80:5901>: IO::Socket::INET: connect: Connection timed out
7213652 | qam-minimal+base | s390x-kvm-sle12 | 2021-09-24 01:58:28 | incomplete | backend died: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out
7212213 | offline_sles15sp2_pscc_lp-basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 02:02:46 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7213165 | qam-minimal+base | s390x-kvm-sle12 | 2021-09-24 02:07:04 | incomplete | backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out
7213815 | qam-minimal+base | s390x-kvm-sle12 | 2021-09-24 02:07:06 | incomplete | backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out
7213167 | mru-install-minimal-with-addons | s390x-kvm-sle12 | 2021-09-24 02:13:49 | incomplete | backend died: Error connecting to VNC server <10.161.145.91:5901>: IO::Socket::INET: connect: Connection timed out
7212153 | online_sles15sp3_pscc_lp-basesys-srv-desk-dev-contm-lgm-tsm-wsm_all_full:investigate:retry | aarch64 | 2021-09-24 02:14:56 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7213137 | qam-gnome | s390x-kvm-sle15 | 2021-09-24 02:48:58 | incomplete | backend died: Error connecting to VNC server <10.161.145.90:5901>: IO::Socket::INET: connect: Connection timed out
7212197 | online_sles15sp2_pscc_basesys-srv-desk-dev-contm-lgm-py2-tsm-wsm_all_full | aarch64 | 2021-09-24 02:54:19 | incomplete | backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 266.
7213903 | ext4_staging_s390x | s390x-kvm-sle12 | 2021-09-24 02:58:03 | incomplete | backend died: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out
Checked some jobs with backend died: QEMU exited unexpectedly, see log for details
, in these jobs' autoinst-log.txt, show:
[2021-09-24T03:41:22.180 CEST] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
QEMU terminated before QMP connection could be established. Check for errors below
[2021-09-24T03:41:22.180 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2021-09-24T03:41:22.181 CEST] [debug] Passing remaining frames to the video encoder
[2021-09-24T03:41:22.248 CEST] [debug] Waiting for video encoder to finalize the video
[2021-09-24T03:41:22.248 CEST] [debug] The built-in video encoder (pid 59450) terminated
[2021-09-24T03:41:22.250 CEST] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-09-24T03:41:22.250 CEST] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-09-24T03:41:22.250 CEST] [warn] !!! : qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on: Could not open backing file: Image is not in qcow2 format
Suggestions¶
- Not related to #98901
- qemu says
Could not open backing file: Image is not in qcow2 format
- Check what recent changes wrt qemu use could have caused this
- Verify if we broke qemu 4.2.1 by supporting 6.0
- Consider the relation to #98727
- Add automatic restarting for known non-critical issues, assuming this issue is flaky
- Ensure that relevant, i.e. "most", scenarios are handled and incomplete jobs (also in the past) are handled, e.g. retriggered and fixed
- Unpause alerts
Rollback steps¶
- Unpause alert and verify that it passes
Updated by livdywan about 3 years ago
osd-deployment / monitor-post-deploy failed because of Incomplete jobs (not restarted) of last 24h alert
Updated by okurz about 3 years ago
cdywan wrote:
Maybe decreased network performance due to bigger images reported in #99246 slowed these down. There's also incompletes on malbec #99117.
Very unlikely. The regression was from beginning of August
Also,
[OK] Incomplete jobs (not restarted) of last 24h alert
16.18 CEST today
Means that some incompletes are now older than 24h
Updated by livdywan about 3 years ago
- Subject changed from [Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24 to [Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24 size:
- Description updated (diff)
- Status changed from New to Workable
I added some suggestions, but since there seem to be different issues I'll try and split this up into dedicated tickets
Updated by osukup about 3 years ago
2021-09-24T03:41:22.250 CEST] [warn] !!! : qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on: Could not open backing file: Image is not in qcow2 format
after some checks it looks like backing file from parent job was damaged during convert or upload :(
Updated by mkittler about 3 years ago
The alert is already off again. It looks like a mix of everything and incompletes happened across all workers, e.g. see output of select id, t_finished, result, substring(reason from 0 for 30) as short_reason, (select host from workers where id = assigned_worker_id) as worker from jobs where result = 'incomplete' and t_finished >= '2021-09-23T18:00:00' and clone_id is null order by t_finished;
.
However, it looks like most of the incompletes were asset download failures, followed by "backend died: Error connecting to VNC server", followed by various backend errors, followed by API failures. Many of the API failures were 503 ones so the web UI was shortly not responding/overloaded. That would also explain failing asset downloads.
openqa=> select count(id), substring(reason from 0 for 35) as short_reason from jobs where result = 'incomplete' and t_finished >= '2021-09-23T18:00:00' and clone_id is null group by short_reason order by count desc;
count | short_reason
-------+------------------------------------
215 | asset failure: Failed to download
184 | backend died: Error connecting to
100 | backend died: QEMU exited unexpect
42 | api failure: Failed to register at
18 | died: terminated prematurely, see
8 | tests died: unable to load main.pm
4 | isotovideo died: Could not find 'c
4 | backend died: QMP command migrate
3 | isotovideo died: Could not find '2
2 | tests died: unable to load tests/i
2 | backend died: Lost SSH connection
1 | backend died: xfreerdp did not sta
1 | isotovideo died: Could not find '6
1 | backend died: backend method not i
1 | backend died: QEMU terminated befo
1 | backend died: Bizarre copy of HASH
If I reduce the query to the biggest increase (from 200 to 300 incompletes in the last 24 hours) it looks like this:
openqa=> select count(id), substring(reason from 0 for 57) as short_reason, array_agg(id) from jobs where result = 'incomplete' and t_finished >= '2021-09-23T18:00:00' and t_finished <= '2021-09-23T21:00:00' and clone_id is null group by short_reason order by count desc;
count | short_reason | array_agg
-------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
42 | api failure: Failed to register at openqa.suse.de - 503 | {7204353,7207323,7201540,7207160,7204230,7207707,7203564,7209676,7203536,7201509,7203540,7203764,7209607,7207133,7209699,7204370,7203544,7204351,7209934,7204372,7201487,7203543,7203998,7203561,7201512,7201542,7201489,7204373,7204371,7203542,7203444
,7203694,7204368,7204369,7203537,7203541,7209649,7203562,7203545,7201527,7204354,7204350}
39 | backend died: QEMU exited unexpectedly, see log for deta | {7204636,7204816,7204510,7204522,7204523,7204528,7204530,7204578,7204579,7204595,7204602,7204804,7204486,7209393,7204577,7204594,7204508,7204516,7204517,7204511,7204515,7204498,7204513,7204507,7204512,7204549,7204817,7204532,7204525,7204593,7204779
,7204589,7204576,7204590,7204616,7204637,7204638,7204806,7204235}
7 | backend died: Error connecting to VNC server <10.161.145 | {7202043,7201998,7202037,7202008,7202019,7202044,7210830}
4 | asset failure: Failed to download sle-15-SP3-s390x-ha-al | {7201958,7201980,7201981,7201959}
2 | backend died: QMP command migrate failed: GenericError; | {7203901,7204635}
1 | asset failure: Failed to download SLE-12-SP3-Server-DVD- | {7210400}
1 | tests died: unable to load main.pm, check the log for th | {7205150}
So I would say the immediate cause was API failures because the web UI was not responsive for too long.
Updated by okurz about 3 years ago
Good investigation. So what do you suggest to do so that reviewers don't need to manually retrigger?
Updated by mkittler about 3 years ago
By the way, that's an example job caused by the web UI being unresponsive (reason is api failure: …
):
[2021-09-23T20:24:23.0994 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:23.0995 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 502 response: Proxy Error
[2021-09-23T20:24:23.0995 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:24.0927 CEST] [warn] [pid:43911] Websocket connection to http://openqa.suse.de/api/v1/ws/1275 finished by remote side with code 1006, no reason - trying again in 10 seconds
[2021-09-23T20:24:29.0018 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:29.0018 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 503 response: Service Unavailable
[2021-09-23T20:24:29.0019 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:34.0036 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:34.0037 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 503 response: Service Unavailable
[2021-09-23T20:24:34.0037 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:34.0937 CEST] [info] [pid:43911] Registering with openQA openqa.suse.de
[2021-09-23T20:24:34.0944 CEST] [warn] [pid:43911] Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.43 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>
- trying again in 10 seconds
[2021-09-23T20:24:39.0096 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:39.0096 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 503 response: Service Unavailable
[2021-09-23T20:24:39.0097 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:44.0119 CEST] [error] [pid:59371] Failed uploading asset
[2021-09-23T20:24:44.0119 CEST] [error] [pid:59371] Error uploading upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: 503 response: Service Unavailable
[2021-09-23T20:24:44.0120 CEST] [error] [pid:59371] Upload failed for chunk 631
[2021-09-23T20:24:44.0951 CEST] [info] [pid:43911] Registering with openQA openqa.suse.de
[2021-09-23T20:24:44.0958 CEST] [warn] [pid:43911] Failed to register at openqa.suse.de - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.43 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>
- trying again in 10 seconds
[2021-09-23T20:24:49.0120 CEST] [info] [pid:59371] upgrade_from-15-SP3-sle-15-SP4-x86_64-Build38.1-alpha-alpha-node01-HA-BV-offline_scc-yaml.qcow2: Processing chunk 631/4310, avg. speed ~37.560 KiB/s
[2021-09-23T20:24:49.0121 CEST] [error] [pid:59371] Upload failed, and all retry attempts have been exhausted
[2021-09-23T20:24:49.0603 CEST] [info] [pid:59371] Uploading video.ogv
Here the web UI looks clearly overloaded, e.g. certain asset uploads fail, then the worker re-registration but then it was still possible to upload the log and reason.
I also checked a bunch of backend died: QEMU exited unexpectedly, see log …
jobs and they all looked like this:
[2021-09-23T20:33:47.508 CEST] [warn] !!! : qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on: Could not open backing file: Image is not in qcow2 format
As @osukup has already mentioned the qcow2 image is possibly damaged. Maybe that's related to the web UI being overloaded?
Judging by the GRU logs the asset cleanup was running at the same time so the load on assets drive was certainly high. Unfortunately logs of the main openQA service aren't available anymore.
Updated by okurz about 3 years ago
- Category set to Regressions/Crashes
okurz wrote:
Good investigation. So what do you suggest to do so that reviewers don't need to manually retrigger?
So in general I think our approach should be retrying in all the cases where it could help. In the end users would not do anything different but retrigger jobs, right?
Updated by mkittler about 3 years ago
- Related to action #99396: Incompletes with auto_review:"api failure: Failed to register .* 503":retry should be restarted automatically added
Updated by mkittler about 3 years ago
- Related to action #99399: openQA (or auto-review) can restart the parent job in case of certain incompletes added
Updated by okurz about 3 years ago
- Subject changed from [Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24 size: to [Alerting] Incomplete jobs (not restarted) of last 24h alert on 2021-9-24
- Description updated (diff)
- Status changed from Workable to In Progress
Paused alerts, extended suggestions
Updated by mkittler about 3 years ago
Updated by okurz about 3 years ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/591 to exclude "tests died" incompletes.
Updated by mkittler about 3 years ago
- Status changed from In Progress to Feedback
I've been running ./openqa-monitor-investigation-candidates
and ./openqa-label-known-issues
to restart the incompletes for #99396. I've checked a few job IDs and they have been restarted, indeed.
Currently there are still too many incompletes so the alert fired again but has been paused. The number of incompletes increased again shortly after '2021-09-27T20:00:00'
. However, almost all of them are because tests died and indeed look irrelevant to us (a YAML file is missing). So the alert should turn off when https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/591 is merged. Then I can resume the alert again.
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved
there you go: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=17&orgId=1&from=1632850722713&to=1632854049029
I resumed the alert and it's fine. As we have enough follow-up tasks identified we are good to call this resolved.