Project

General

Profile

action #16430

[tools] qcow2 image upload fails with no apparent reason and job ends up incomplete

Added by okurz over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
-
Start date:
2017-02-02
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

observation

https://openqa.opensuse.org/tests/346076/file/autoinst-log.txt
shows

21:52:08.5284 2867 running nice ionice qemu-img convert -O qcow2 raid/l1 assets_public/opensuse-42.3-x86_64-0063-gnome@64bit.qcow2 -c
p11-kit: couldn't list directory: /etc/pkcs11/modules: Permission denied
Failed to open module: /usr/lib64/qemu/block-iscsi.so: failed to map segment from shared object
Failed to open module: /usr/lib64/qemu/block-ssh.so: failed to map segment from shared object
2867: EXIT 0
+++ worker notes +++
end time: 2017-02-02 21:56:48
result: done
uploading consoletest_setup-loadavg_consoletest_setup.txt
uploading consoletest_setup-psaxf.log
uploading install_and_reboot-y2logs.tar.bz2
uploading autoinst.xml
Checksum :1888125274 Sizes::58699
uploading video.ogv
uploading vars.json
uploading serial0.txt
uploading autoinst-log.txt

and then nothing.

steps to reproduce

A lot of job with same symptoms, e.g. take a look for all incompletes on https://openqa.opensuse.org/tests?hoursfresh=24&match=gnome-image but apparently only on openqaworker1

problem

H1. openqaworker1 has problems uploading the images
H2. The other error messages in the log about "permission denied" and such are related to the problem

suggestion

proper log messages about this.

workaround

maybe retrigger jobs until any other worker than openqaworker1 is hit?


Related issues

Related to openQA Project - action #17218: Can't publish HDD1: cksum: /var/lib/openqa/share/factory/hdd/sle-12-SP3-x86_64-0248-textmode@64bit.qcow2.TEMP-IMZWlhMn: Permission deniedResolved2017-02-20

Related to openQA Project - action #19546: QCOW uploaded as asset won't appear on NFS pool and Assets tabResolved2017-06-02

Has duplicate openQA Tests - action #19670: [tools][sle][functional][aarch64] test failed to upload disk imageRejected2017-06-08

Has duplicate openQA Tests - action #19986: [sled] checksum failed when publishing cryptlvm hdd imagesRejected2017-06-22

Copied to openQA Project - action #20786: [tools] image upload problems are not reported properlyClosed2017-07-26

History

#2 Updated by szarate over 4 years ago

  • Related to action #17218: Can't publish HDD1: cksum: /var/lib/openqa/share/factory/hdd/sle-12-SP3-x86_64-0248-textmode@64bit.qcow2.TEMP-IMZWlhMn: Permission denied added

#3 Updated by szarate over 4 years ago

poo#17218 might be related here

#4 Updated by coolo over 4 years ago

  • Status changed from New to Rejected

17218 is not related - and as this is not an often seen problem I close it as random issue in the infrastructure

#5 Updated by michalnowak over 4 years ago

  • Related to action #19546: QCOW uploaded as asset won't appear on NFS pool and Assets tab added

#6 Updated by okurz over 4 years ago

  • Status changed from Rejected to In Progress
  • Priority changed from High to Low
  • Target version set to future

https://openqa.suse.de/tests/984049/file/autoinst-log.txt
https://openqa.suse.de/tests/983968/file/autoinst-log.txt

two times in the same build, different arch, both times same test suite "cryptlvm". maybe the disk image is just very big in these cases:

uploading sle-12-SP3-Server-DVD-aarch64-gnome-encrypted.qcow2
Checksum comparison (actual:expected) 4294967295:4149226778 with size (actual:expected) 0:5163515904
Checksum/size comparison of sle-12-SP3-Server-DVD-aarch64-gnome-encrypted.qcow2 FAILED
uploading sle-12-SP3-Server-DVD-ppc64le-gnome-encrypted.qcow2
Checksum comparison (actual:expected) 4294967295:194332624 with size (actual:expected) 0:10527047680
Checksum/size comparison of sle-12-SP3-Server-DVD-ppc64le-gnome-encrypted.qcow2 FAILED

#7 Updated by vsvecova over 4 years ago

Another example:
https://openqa.suse.de/tests/987817/file/autoinst-log.txt

uploading SLED-12-SP2-20170607-2-x86_64_for_regression.qcow2
Checksum comparison (actual:expected) 4294967295:1762304692 with size (actual:expected) 0:3093506560
Checksum/size comparison of SLED-12-SP2-20170607-2-x86_64_for_regression.qcow2 FAILED

Restarting does not help, yields the same result:
https://openqa.suse.de/tests/987993

Yet another:
https://openqa.suse.de/tests/988271/file/autoinst-log.txt

uploading SLED-12-SP2-20170608-1-x86_64_for_regression.qcow2
Checksum comparison (actual:expected) 4294967295:4195181595 with size (actual:expected) 0:3076052992
Checksum/size comparison of SLED-12-SP2-20170608-1-x86_64_for_regression.qcow2 FAILED

#8 Updated by okurz over 4 years ago

  • Subject changed from qcow2 image upload fails with no apparent reason and job ends up incomplete to [tools]qcow2 image upload fails with no apparent reason and job ends up incomplete
  • Priority changed from Low to Normal
  • Target version deleted (future)

ok, so it's a thing. How about we further improve our logging? Someone has suggestions? E.g. when size is zero, what should we output? Or check logfiles on the server?

#9 Updated by okurz over 4 years ago

  • Has duplicate action #19670: [tools][sle][functional][aarch64] test failed to upload disk image added

#10 Updated by okurz over 4 years ago

Happened in https://openqa.suse.de/tests/987904 as well (see #19670)

#11 Updated by vsvecova over 4 years ago

Today's example:

https://openqa.suse.de/tests/1011148

uploading SLED-12-SP2-20170620-6-x86_64_for_regression.qcow2
Checksum comparison (actual:expected) 4294967295:3593870896 with size (actual:expected) 0:3106909696
Checksum/size comparison of SLED-12-SP2-20170620-6-x86_64_for_regression.qcow2 FAILED

#12 Updated by okurz over 4 years ago

  • Has duplicate action #19986: [sled] checksum failed when publishing cryptlvm hdd images added

#13 Updated by dzedro over 4 years ago

#14 Updated by riafarov over 4 years ago

#15 Updated by coolo over 4 years ago

  • Subject changed from [tools]qcow2 image upload fails with no apparent reason and job ends up incomplete to [tools] qcow2 image upload fails with no apparent reason and job ends up incomplete

the last examples (also as of today) are all about rewriting existant files. For some reason all the cryptlvm hdds do not have the build number in the filename. The examples in #7 have though

#16 Updated by coolo over 4 years ago

  • Assignee set to kraih

Debugging https://openqa.suse.de/tests/1024024 ...

From autoinst log:

uploading textinfo-logs.tar.bz2
uploading sle-12-SP3-Desktop-DVD-x86_64-gnome-encrypted.qcow2
Checksum comparison (actual:expected) 4294967295:4170063198 with size (actual:expected) 0:7906116096
Checksum/size comparison of sle-12-SP3-Desktop-DVD-x86_64-gnome-encrypted.qcow2 FAILED
uploading video.ogv

The upload happened around 05:09 - last successful log:

[Tue Jun 27 05:09:49 2017] [17003:debug] moved to /var/lib/openqa/testresults/01024/01024024-sle-12-SP3-Desktop-DVD-x86_64-Build0274-cryptlvm@64bit/ulogs textinfo-logs.tar.bz2
[Tue Jun 27 05:09:49 2017] [17003:debug] 200 OK (0.009422s, 106.135/s)

Next successful log:

[Tue Jun 27 05:11:37 2017] [17469:debug] moved to /var/lib/openqa/testresults/01024/01024024-sle-12-SP3-Desktop-DVD-x86_64-Build0274-cryptlvm@64bit video.ogv
[Tue Jun 27 05:11:37 2017] [17469:debug] 200 OK (0.01602s, 62.422/s)

In these 2 minutes we saw 1248 POSTs - so it was a quite busy time. But what is worrying me most: we saw 112 worker restarts due to rss limit. So I'm afraid these restarts happen too early - or too late.

This might also be related to the random problems we see on cavil. For reference: this is the restart code: https://github.com/os-autoinst/openQA/pull/1339

#17 Updated by coolo over 4 years ago

I raised the rss limit to 580.000 and restarted the job. Let's see

#18 Updated by coolo over 4 years ago

the upload worked now

#19 Updated by coolo over 4 years ago

So now going to insane low values

#20 Updated by coolo over 4 years ago

  • Assignee changed from kraih to coolo

#21 Updated by coolo over 4 years ago

this is from apache log about the above:

``
[Tue Jun 27 05:00:02.042426 2017] [proxy_http:error] pid 14598Connection reset by peer: [client 2620:113:80c0:8080:10:160:1:100:47638] AH01102: error reading status line from remote server localhost:9526
[Tue Jun 27 05:00:02.042465 2017] [proxy:error] [pid 14598] [client 2620:113:80c0:8080:10:160:1:100:47638] AH00898: Error reading from remote server returned by /api/v1/jobs/1024024/artefact

#22 Updated by coolo about 4 years ago

My latest findings for job 1039448 support my theory it's a mojo bug. I added debug for upload progress and what we see is a working
having a running transaction and restarting due to RSS limit:


17:25 <coolo> [Mon Jul 3 16:59:28 2017] [22634:debug] Worker 22634 has accepted the request for /api/v1/jobs/1039448/artefact
17:25 <coolo> [Mon Jul 3 16:59:28 2017] [22634:debug] Worker exceeded RSS limit "219232 > 200000", restarting
17:25 <coolo> [Mon Jul 3 16:59:28 2017] [24509:debug] Stopping worker 22634
17:25 <coolo> [Mon Jul 3 16:59:28 2017] [24509:debug] Worker 22634 stopped

The graceful restart restarted by the RSS limit is supposed to finish running transactions first

#23 Updated by qkzhu about 4 years ago

failed at: sle-12-SP3-Desktop-DVD-x86_64-Build0281-regression-installation@64bit https://openqa.suse.de/tests/1039589

#24 Updated by coolo about 4 years ago

ok, we know what it is. The mojo worker is restarted when it hits certain limits. But this restart is 'graceful', so it first finishes the current requests.

But the timeout for that is by default 20 seconds - and uploading a 8G file can take longer. So if the restart trigger is hit right after accepting the upload, chances are you don't finish in these 20 seconds and we fail.

I will set this limit to 800, but we will need to face workers that are in long timeouts and do not accept requests anymore. So possibly increase their number as well.

#25 Updated by kraih about 4 years ago

To help with the problem of too many workers not accepting connections because they are shutting down, i've added
a new feature to Mojolicious. The web server can now
temporarily spawn a few additional workers as soon as an existing worker begins the graceful shutdown process.
The default is for up to two spare workers to be spawned, but that can be tweaked with the -s flag.

#26 Updated by coolo about 4 years ago

  • Status changed from In Progress to Resolved

#27 Updated by okurz about 4 years ago

  • Status changed from Resolved to In Progress

https://openqa.suse.de/tests/1076847/file/autoinst-log.txt looks exactly like that problem. Could it be a different problem even though the symptoms show up to be the same? How can we be sure? Some worker logs to check here? Or the logs from the web ui host at the same time?

#28 Updated by okurz about 4 years ago

Also https://openqa.suse.de/tests/1077066 and https://openqa.suse.de/tests/1077199 failed, reproducible problem, different workers.

#29 Updated by coolo about 4 years ago

It's a different problem though. The upload is done, but the rename fails.

coolo@openqa:/var/lib/openqa/factory/hdd> ls -l SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2*
-rw-r--r-- 1 geekotest nogroup 916783104  8. Jul 17:16 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2
-rw-r--r-- 1 geekotest nogroup 921108480 26. Jul 09:59 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-0YQHK5WS
-rw-r--r-- 1 geekotest nogroup 921567232 26. Jul 11:11 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tKieeJTx
-rw-r--r-- 1 geekotest nogroup 914882560 26. Jul 11:20 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tzKPM_MJ

[Wed Jul 26 11:20:31 2017] [4779:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "create_artefact"
[Wed Jul 26 11:20:31 2017] [4779:debug] moved to /var/lib/openqa/share/factory/hdd/SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tzKPM_MJ
[Wed Jul 26 11:20:31 2017] [4779:error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  duplicate key value violates unique constraint "jobs_
assets_job_id_asset_id"
DETAIL:  Key (job_id, asset_id)=(1077199, 58485) already exists. [for Statement "INSERT INTO jobs_assets ( asset_id, created_by, job_id, t_created, t_updated) VALUES ( ?, ?, ?, ?, ? )" wi
th ParamValues: 1='58485', 2='1', 3='1077199', 4='2017-07-26 09:20:31', 5='2017-07-26 09:20:31'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1084

[Wed Jul 26 11:20:31 2017] [4779:debug] 500 Internal Server Error (0.026118s, 38.288/s)

The underlying problem is still that the upload failures aren't reported properly

#30 Updated by coolo about 4 years ago

what kind of test is this with HDD_1 and PUBLISH_HDD_1 being the same value?

#31 Updated by coolo about 4 years ago

  • Status changed from In Progress to Resolved

this is something else and your reopen is invalid

#32 Updated by okurz about 4 years ago

  • Copied to action #20786: [tools] image upload problems are not reported properly added

Also available in: Atom PDF