action #16430: [tools] qcow2 image upload fails with no apparent reason and job ends up incomplete - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #16430

closed

[tools] qcow2 image upload fails with no apparent reason and job ends up incomplete

Added by okurz about 8 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

coolo

Category:

Regressions/Crashes

Target version:

Start date:

2017-02-02

Due date:

% Done:

Estimated time:

Description

observation¶

https://openqa.opensuse.org/tests/346076/file/autoinst-log.txt
shows

21:52:08.5284 2867 running nice ionice qemu-img convert -O qcow2 raid/l1 assets_public/opensuse-42.3-x86_64-0063-gnome@64bit.qcow2 -c
p11-kit: couldn't list directory: /etc/pkcs11/modules: Permission denied
Failed to open module: /usr/lib64/qemu/block-iscsi.so: failed to map segment from shared object
Failed to open module: /usr/lib64/qemu/block-ssh.so: failed to map segment from shared object
2867: EXIT 0
+++ worker notes +++
end time: 2017-02-02 21:56:48
result: done
uploading consoletest_setup-loadavg_consoletest_setup.txt
uploading consoletest_setup-psaxf.log
uploading install_and_reboot-y2logs.tar.bz2
uploading autoinst.xml
Checksum :1888125274 Sizes::58699
uploading video.ogv
uploading vars.json
uploading serial0.txt
uploading autoinst-log.txt

and then nothing.

steps to reproduce¶

A lot of job with same symptoms, e.g. take a look for all incompletes on https://openqa.opensuse.org/tests?hoursfresh=24&match=gnome-image but apparently only on openqaworker1

problem¶

H1. openqaworker1 has problems uploading the images
H2. The other error messages in the log about "permission denied" and such are related to the problem

suggestion¶

proper log messages about this.

workaround¶

maybe retrigger jobs until any other worker than openqaworker1 is hit?

Related issues 5 (0 open — 5 closed)

Actions

Copy link

Updated by okurz about 8 years ago

another example: https://openqa.opensuse.org/tests/348306

Actions

Copy link

Updated by szarate about 8 years ago

Related to action #17218: Can't publish HDD1: cksum: /var/lib/openqa/share/factory/hdd/sle-12-SP3-x86_64-0248-textmode@64bit.qcow2.TEMP-IMZWlhMn: Permission denied added

Actions

Copy link

Updated by szarate about 8 years ago

poo#17218 might be related here

Actions

Copy link

Updated by coolo almost 8 years ago

Status changed from New to Rejected

17218 is not related - and as this is not an often seen problem I close it as random issue in the infrastructure

Actions

Copy link

Updated by michalnowak almost 8 years ago

Related to action #19546: QCOW uploaded as asset won't appear on NFS pool and Assets tab added

Actions

Copy link

Updated by okurz over 7 years ago

Status changed from Rejected to In Progress
Priority changed from High to Low
Target version set to future

https://openqa.suse.de/tests/984049/file/autoinst-log.txt
https://openqa.suse.de/tests/983968/file/autoinst-log.txt

two times in the same build, different arch, both times same test suite "cryptlvm". maybe the disk image is just very big in these cases:

uploading sle-12-SP3-Server-DVD-aarch64-gnome-encrypted.qcow2
Checksum comparison (actual:expected) 4294967295:4149226778 with size (actual:expected) 0:5163515904
Checksum/size comparison of sle-12-SP3-Server-DVD-aarch64-gnome-encrypted.qcow2 FAILED

uploading sle-12-SP3-Server-DVD-ppc64le-gnome-encrypted.qcow2
Checksum comparison (actual:expected) 4294967295:194332624 with size (actual:expected) 0:10527047680
Checksum/size comparison of sle-12-SP3-Server-DVD-ppc64le-gnome-encrypted.qcow2 FAILED

Actions

Copy link

Updated by vsvecova over 7 years ago

Another example:
https://openqa.suse.de/tests/987817/file/autoinst-log.txt

uploading SLED-12-SP2-20170607-2-x86_64_for_regression.qcow2
Checksum comparison (actual:expected) 4294967295:1762304692 with size (actual:expected) 0:3093506560
Checksum/size comparison of SLED-12-SP2-20170607-2-x86_64_for_regression.qcow2 FAILED

Restarting does not help, yields the same result:
https://openqa.suse.de/tests/987993

Yet another:
https://openqa.suse.de/tests/988271/file/autoinst-log.txt

uploading SLED-12-SP2-20170608-1-x86_64_for_regression.qcow2
Checksum comparison (actual:expected) 4294967295:4195181595 with size (actual:expected) 0:3076052992
Checksum/size comparison of SLED-12-SP2-20170608-1-x86_64_for_regression.qcow2 FAILED

Actions

Copy link

Updated by okurz over 7 years ago

Subject changed from qcow2 image upload fails with no apparent reason and job ends up incomplete to [tools]qcow2 image upload fails with no apparent reason and job ends up incomplete
Priority changed from Low to Normal
Target version deleted (~~future~~)

ok, so it's a thing. How about we further improve our logging? Someone has suggestions? E.g. when size is zero, what should we output? Or check logfiles on the server?

Actions

Copy link

Updated by okurz over 7 years ago

Has duplicate action #19670: [tools][sle][functional][aarch64] test failed to upload disk image added

Actions

Copy link

#10

Updated by okurz over 7 years ago

Happened in https://openqa.suse.de/tests/987904 as well (see #19670)

Actions

Copy link

#11

Updated by vsvecova over 7 years ago

Today's example:

https://openqa.suse.de/tests/1011148

uploading SLED-12-SP2-20170620-6-x86_64_for_regression.qcow2
Checksum comparison (actual:expected) 4294967295:3593870896 with size (actual:expected) 0:3106909696
Checksum/size comparison of SLED-12-SP2-20170620-6-x86_64_for_regression.qcow2 FAILED

Actions

Copy link

#12

Updated by okurz over 7 years ago

Has duplicate action #19986: [sled] checksum failed when publishing cryptlvm hdd images added

Actions

Copy link

#13

Updated by dzedro over 7 years ago

Build 0437 failure https://openqa.suse.de/tests/1016254

Actions

Copy link

#14

Updated by riafarov over 7 years ago

Another example here: https://openqa.suse.de/tests/1018503#

Actions

Copy link

#15

Updated by coolo over 7 years ago

Subject changed from [tools]qcow2 image upload fails with no apparent reason and job ends up incomplete to [tools] qcow2 image upload fails with no apparent reason and job ends up incomplete

the last examples (also as of today) are all about rewriting existant files. For some reason all the cryptlvm hdds do not have the build number in the filename. The examples in #7 have though

Actions

Copy link

#16

Updated by coolo over 7 years ago

Assignee set to kraih

Debugging https://openqa.suse.de/tests/1024024 ...

From autoinst log:
uploading textinfo-logs.tar.bz2 uploading sle-12-SP3-Desktop-DVD-x86_64-gnome-encrypted.qcow2 Checksum comparison (actual:expected) 4294967295:4170063198 with size (actual:expected) 0:7906116096 Checksum/size comparison of sle-12-SP3-Desktop-DVD-x86_64-gnome-encrypted.qcow2 FAILED uploading video.ogv
The upload happened around 05:09 - last successful log:
[Tue Jun 27 05:09:49 2017] [17003:debug] moved to /var/lib/openqa/testresults/01024/01024024-sle-12-SP3-Desktop-DVD-x86_64-Build0274-cryptlvm@64bit/ulogs textinfo-logs.tar.bz2 [Tue Jun 27 05:09:49 2017] [17003:debug] 200 OK (0.009422s, 106.135/s)

Next successful log:
[Tue Jun 27 05:11:37 2017] [17469:debug] moved to /var/lib/openqa/testresults/01024/01024024-sle-12-SP3-Desktop-DVD-x86_64-Build0274-cryptlvm@64bit video.ogv [Tue Jun 27 05:11:37 2017] [17469:debug] 200 OK (0.01602s, 62.422/s)

In these 2 minutes we saw 1248 POSTs - so it was a quite busy time. But what is worrying me most: we saw 112 worker restarts due to rss limit. So I'm afraid these restarts happen too early - or too late.

This might also be related to the random problems we see on cavil. For reference: this is the restart code: https://github.com/os-autoinst/openQA/pull/1339

Actions

Copy link

#17

Updated by coolo over 7 years ago

I raised the rss limit to 580.000 and restarted the job. Let's see

Actions

Copy link

#18

Updated by coolo over 7 years ago

the upload worked now

Actions

Copy link

#19

Updated by coolo over 7 years ago

So now going to insane low values

Actions

Copy link

#20

Updated by coolo over 7 years ago

Assignee changed from kraih to coolo

Actions

Copy link

#21

Updated by coolo over 7 years ago

this is from apache log about the above:

``
[Tue Jun 27 05:00:02.042426 2017] [proxy_http:error] pid 14598Connection reset by peer: [client 2620:113:80c0:8080:10:160:1:100:47638] AH01102: error reading status line from remote server localhost:9526
[Tue Jun 27 05:00:02.042465 2017] [proxy:error] [pid 14598] [client 2620:113:80c0:8080:10:160:1:100:47638] AH00898: Error reading from remote server returned by /api/v1/jobs/1024024/artefact

Actions

Copy link

#22

Updated by coolo over 7 years ago

My latest findings for job 1039448 support my theory it's a mojo bug. I added debug for upload progress and what we see is a working
having a running transaction and restarting due to RSS limit:

17:25 <coolo> [Mon Jul 3 16:59:28 2017] [22634:debug] Worker 22634 has accepted the request for /api/v1/jobs/1039448/artefact 17:25 <coolo> [Mon Jul 3 16:59:28 2017] [22634:debug] Worker exceeded RSS limit "219232 > 200000", restarting 17:25 <coolo> [Mon Jul 3 16:59:28 2017] [24509:debug] Stopping worker 22634 17:25 <coolo> [Mon Jul 3 16:59:28 2017] [24509:debug] Worker 22634 stopped

The graceful restart restarted by the RSS limit is supposed to finish running transactions first

Actions

Copy link

#23

Updated by qkzhu over 7 years ago

failed at: sle-12-SP3-Desktop-DVD-x86_64-Build0281-regression-installation@64bit https://openqa.suse.de/tests/1039589

Actions

Copy link

#24

Updated by coolo over 7 years ago

ok, we know what it is. The mojo worker is restarted when it hits certain limits. But this restart is 'graceful', so it first finishes the current requests.

But the timeout for that is by default 20 seconds - and uploading a 8G file can take longer. So if the restart trigger is hit right after accepting the upload, chances are you don't finish in these 20 seconds and we fail.

I will set this limit to 800, but we will need to face workers that are in long timeouts and do not accept requests anymore. So possibly increase their number as well.

Actions

Copy link

#25

Updated by kraih over 7 years ago

To help with the problem of too many workers not accepting connections because they are shutting down, i've added
a new feature to Mojolicious. The web server can now
temporarily spawn a few additional workers as soon as an existing worker begins the graceful shutdown process.
The default is for up to two spare workers to be spawned, but that can be tweaked with the -s flag.

Actions

Copy link

#26

Updated by coolo over 7 years ago

Status changed from In Progress to Resolved

https://github.com/os-autoinst/openQA/pull/1385 is hotpatched on osd

Actions

Copy link

#27

Updated by okurz over 7 years ago

Status changed from Resolved to In Progress

https://openqa.suse.de/tests/1076847/file/autoinst-log.txt looks exactly like that problem. Could it be a different problem even though the symptoms show up to be the same? How can we be sure? Some worker logs to check here? Or the logs from the web ui host at the same time?

Actions

Copy link

#28

Updated by okurz over 7 years ago

Also https://openqa.suse.de/tests/1077066 and https://openqa.suse.de/tests/1077199 failed, reproducible problem, different workers.

Actions

Copy link

#29

Updated by coolo over 7 years ago

It's a different problem though. The upload is done, but the rename fails.

coolo@openqa:/var/lib/openqa/factory/hdd> ls -l SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2*
-rw-r--r-- 1 geekotest nogroup 916783104  8. Jul 17:16 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2
-rw-r--r-- 1 geekotest nogroup 921108480 26. Jul 09:59 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-0YQHK5WS
-rw-r--r-- 1 geekotest nogroup 921567232 26. Jul 11:11 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tKieeJTx
-rw-r--r-- 1 geekotest nogroup 914882560 26. Jul 11:20 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tzKPM_MJ

[Wed Jul 26 11:20:31 2017] [4779:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "create_artefact"
[Wed Jul 26 11:20:31 2017] [4779:debug] moved to /var/lib/openqa/share/factory/hdd/SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tzKPM_MJ
[Wed Jul 26 11:20:31 2017] [4779:error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  duplicate key value violates unique constraint "jobs_
assets_job_id_asset_id"
DETAIL:  Key (job_id, asset_id)=(1077199, 58485) already exists. [for Statement "INSERT INTO jobs_assets ( asset_id, created_by, job_id, t_created, t_updated) VALUES ( ?, ?, ?, ?, ? )" wi
th ParamValues: 1='58485', 2='1', 3='1077199', 4='2017-07-26 09:20:31', 5='2017-07-26 09:20:31'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1084

[Wed Jul 26 11:20:31 2017] [4779:debug] 500 Internal Server Error (0.026118s, 38.288/s)

The underlying problem is still that the upload failures aren't reported properly

Actions

Copy link

#30

Updated by coolo over 7 years ago

what kind of test is this with HDD_1 and PUBLISH_HDD_1 being the same value?

Actions

Copy link

#31

Updated by coolo over 7 years ago

Status changed from In Progress to Resolved

this is something else and your reopen is invalid

Actions

Copy link

#32

Updated by okurz over 7 years ago

Copied to action #20786: [tools] image upload problems are not reported properly added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #16430

[tools] qcow2 image upload fails with no apparent reason and job ends up incomplete

observation¶

steps to reproduce¶

problem¶

suggestion¶

workaround¶

Updated by okurz about 8 years ago

Updated by szarate about 8 years ago

Updated by szarate about 8 years ago

Updated by coolo almost 8 years ago

Updated by michalnowak almost 8 years ago

Updated by okurz over 7 years ago

Updated by vsvecova over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by vsvecova over 7 years ago

Updated by okurz over 7 years ago

Updated by dzedro over 7 years ago

Updated by riafarov over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by qkzhu over 7 years ago

Updated by coolo over 7 years ago

Updated by kraih over 7 years ago

Updated by coolo over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by coolo over 7 years ago

Updated by okurz over 7 years ago