[tools] qcow2 image upload fails with no apparent reason and job ends up incomplete
21:52:08.5284 2867 running nice ionice qemu-img convert -O qcow2 raid/l1 email@example.com -c p11-kit: couldn't list directory: /etc/pkcs11/modules: Permission denied Failed to open module: /usr/lib64/qemu/block-iscsi.so: failed to map segment from shared object Failed to open module: /usr/lib64/qemu/block-ssh.so: failed to map segment from shared object 2867: EXIT 0 +++ worker notes +++ end time: 2017-02-02 21:56:48 result: done uploading consoletest_setup-loadavg_consoletest_setup.txt uploading consoletest_setup-psaxf.log uploading install_and_reboot-y2logs.tar.bz2 uploading autoinst.xml Checksum :1888125274 Sizes::58699 uploading video.ogv uploading vars.json uploading serial0.txt uploading autoinst-log.txt
and then nothing.
steps to reproduce¶
A lot of job with same symptoms, e.g. take a look for all incompletes on https://openqa.opensuse.org/tests?hoursfresh=24&match=gnome-image but apparently only on openqaworker1
H1. openqaworker1 has problems uploading the images
H2. The other error messages in the log about "permission denied" and such are related to the problem
proper log messages about this.
maybe retrigger jobs until any other worker than openqaworker1 is hit?
#6 Updated by okurz over 4 years ago
- Status changed from Rejected to In Progress
- Priority changed from High to Low
- Target version set to future
two times in the same build, different arch, both times same test suite "cryptlvm". maybe the disk image is just very big in these cases:
uploading sle-12-SP3-Server-DVD-aarch64-gnome-encrypted.qcow2 Checksum comparison (actual:expected) 4294967295:4149226778 with size (actual:expected) 0:5163515904 Checksum/size comparison of sle-12-SP3-Server-DVD-aarch64-gnome-encrypted.qcow2 FAILED
uploading sle-12-SP3-Server-DVD-ppc64le-gnome-encrypted.qcow2 Checksum comparison (actual:expected) 4294967295:194332624 with size (actual:expected) 0:10527047680 Checksum/size comparison of sle-12-SP3-Server-DVD-ppc64le-gnome-encrypted.qcow2 FAILED
#7 Updated by vsvecova over 4 years ago
uploading SLED-12-SP2-20170607-2-x86_64_for_regression.qcow2 Checksum comparison (actual:expected) 4294967295:1762304692 with size (actual:expected) 0:3093506560 Checksum/size comparison of SLED-12-SP2-20170607-2-x86_64_for_regression.qcow2 FAILED
Restarting does not help, yields the same result:
uploading SLED-12-SP2-20170608-1-x86_64_for_regression.qcow2 Checksum comparison (actual:expected) 4294967295:4195181595 with size (actual:expected) 0:3076052992 Checksum/size comparison of SLED-12-SP2-20170608-1-x86_64_for_regression.qcow2 FAILED
#8 Updated by okurz over 4 years ago
- Subject changed from qcow2 image upload fails with no apparent reason and job ends up incomplete to [tools]qcow2 image upload fails with no apparent reason and job ends up incomplete
- Priority changed from Low to Normal
- Target version deleted (
ok, so it's a thing. How about we further improve our logging? Someone has suggestions? E.g. when size is zero, what should we output? Or check logfiles on the server?
#11 Updated by vsvecova over 4 years ago
uploading SLED-12-SP2-20170620-6-x86_64_for_regression.qcow2 Checksum comparison (actual:expected) 4294967295:3593870896 with size (actual:expected) 0:3106909696 Checksum/size comparison of SLED-12-SP2-20170620-6-x86_64_for_regression.qcow2 FAILED
#15 Updated by coolo over 4 years ago
- Subject changed from [tools]qcow2 image upload fails with no apparent reason and job ends up incomplete to [tools] qcow2 image upload fails with no apparent reason and job ends up incomplete
the last examples (also as of today) are all about rewriting existant files. For some reason all the cryptlvm hdds do not have the build number in the filename. The examples in #7 have though
#16 Updated by coolo over 4 years ago
- Assignee set to kraih
Debugging https://openqa.suse.de/tests/1024024 ...
From autoinst log:
Checksum comparison (actual:expected) 4294967295:4170063198 with size (actual:expected) 0:7906116096
Checksum/size comparison of sle-12-SP3-Desktop-DVD-x86_64-gnome-encrypted.qcow2 FAILED
The upload happened around 05:09 - last successful log:
[Tue Jun 27 05:09:49 2017] [17003:debug] moved to /var/lib/openqa/testresults/01024/01024024-sle-12-SP3-Desktop-DVD-x86_64-Build0274-cryptlvm@64bit/ulogs textinfo-logs.tar.bz2
[Tue Jun 27 05:09:49 2017] [17003:debug] 200 OK (0.009422s, 106.135/s)
Next successful log:
[Tue Jun 27 05:11:37 2017] [17469:debug] moved to /var/lib/openqa/testresults/01024/01024024-sle-12-SP3-Desktop-DVD-x86_64-Build0274-cryptlvm@64bit video.ogv
[Tue Jun 27 05:11:37 2017] [17469:debug] 200 OK (0.01602s, 62.422/s)
In these 2 minutes we saw 1248 POSTs - so it was a quite busy time. But what is worrying me most: we saw 112 worker restarts due to rss limit. So I'm afraid these restarts happen too early - or too late.
This might also be related to the random problems we see on cavil. For reference: this is the restart code: https://github.com/os-autoinst/openQA/pull/1339
#21 Updated by coolo over 4 years ago
this is from apache log about the above:
[Tue Jun 27 05:00:02.042426 2017] [proxy_http:error] pid 14598Connection reset by peer: [client 2620:113:80c0:8080:10:160:1:100:47638] AH01102: error reading status line from remote server localhost:9526
[Tue Jun 27 05:00:02.042465 2017] [proxy:error] [pid 14598] [client 2620:113:80c0:8080:10:160:1:100:47638] AH00898: Error reading from remote server returned by /api/v1/jobs/1024024/artefact
#22 Updated by coolo over 4 years ago
My latest findings for job 1039448 support my theory it's a mojo bug. I added debug for upload progress and what we see is a working
having a running transaction and restarting due to RSS limit:
17:25 <coolo> [Mon Jul 3 16:59:28 2017] [22634:debug] Worker 22634 has accepted the request for /api/v1/jobs/1039448/artefact
17:25 <coolo> [Mon Jul 3 16:59:28 2017] [22634:debug] Worker exceeded RSS limit "219232 > 200000", restarting
17:25 <coolo> [Mon Jul 3 16:59:28 2017] [24509:debug] Stopping worker 22634
17:25 <coolo> [Mon Jul 3 16:59:28 2017] [24509:debug] Worker 22634 stopped
The graceful restart restarted by the RSS limit is supposed to finish running transactions first
#24 Updated by coolo over 4 years ago
ok, we know what it is. The mojo worker is restarted when it hits certain limits. But this restart is 'graceful', so it first finishes the current requests.
But the timeout for that is by default 20 seconds - and uploading a 8G file can take longer. So if the restart trigger is hit right after accepting the upload, chances are you don't finish in these 20 seconds and we fail.
I will set this limit to 800, but we will need to face workers that are in long timeouts and do not accept requests anymore. So possibly increase their number as well.
#25 Updated by kraih over 4 years ago
To help with the problem of too many workers not accepting connections because they are shutting down, i've added
a new feature to Mojolicious. The web server can now
temporarily spawn a few additional workers as soon as an existing worker begins the graceful shutdown process.
The default is for up to two spare workers to be spawned, but that can be tweaked with the
#27 Updated by okurz over 4 years ago
- Status changed from Resolved to In Progress
https://openqa.suse.de/tests/1076847/file/autoinst-log.txt looks exactly like that problem. Could it be a different problem even though the symptoms show up to be the same? How can we be sure? Some worker logs to check here? Or the logs from the web ui host at the same time?
#29 Updated by coolo over 4 years ago
It's a different problem though. The upload is done, but the rename fails.
coolo@openqa:/var/lib/openqa/factory/hdd> ls -l SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2* -rw-r--r-- 1 geekotest nogroup 916783104 8. Jul 17:16 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2 -rw-r--r-- 1 geekotest nogroup 921108480 26. Jul 09:59 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-0YQHK5WS -rw-r--r-- 1 geekotest nogroup 921567232 26. Jul 11:11 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tKieeJTx -rw-r--r-- 1 geekotest nogroup 914882560 26. Jul 11:20 SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tzKPM_MJ [Wed Jul 26 11:20:31 2017] [4779:debug] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "create_artefact" [Wed Jul 26 11:20:31 2017] [4779:debug] moved to /var/lib/openqa/share/factory/hdd/SLES-12-SP3-aarch64-0472-minimal_with_sdk0262_installed.qcow2.TEMP-tzKPM_MJ [Wed Jul 26 11:20:31 2017] [4779:error] DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: duplicate key value violates unique constraint "jobs_ assets_job_id_asset_id" DETAIL: Key (job_id, asset_id)=(1077199, 58485) already exists. [for Statement "INSERT INTO jobs_assets ( asset_id, created_by, job_id, t_created, t_updated) VALUES ( ?, ?, ?, ?, ? )" wi th ParamValues: 1='58485', 2='1', 3='1077199', 4='2017-07-26 09:20:31', 5='2017-07-26 09:20:31'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1084 [Wed Jul 26 11:20:31 2017] [4779:debug] 500 Internal Server Error (0.026118s, 38.288/s)
The underlying problem is still that the upload failures aren't reported properly