Project

General

Profile

Actions

action #12838

closed

sporadic "corrupt images" in various tests or fails uploading, e.g. with "Premature connection close"

Added by okurz over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2016-06-15
Due date:
% Done:

0%

Estimated time:

Description

observation

many jobs have problems reusing an image created by another openQA job, e.g.
e.g. see https://openqa.opensuse.org/tests/228821 trying to use image created by
https://openqa.opensuse.org/tests/228817
In this create job the image checksum is computed

06:06:03.6884 10291 sha1sum fcde63ff2f782dc70933658b5c7abb5f91825480 *assets_public/opensuse-42.2-x86_64-0112-textmode@64bit.qcow2

however checking the current image yields different result:

okurz@ariel:~> sha1sum /var/lib/openqa/share/factory/hdd/opensuse-42.2-x86_64-0112-textmode@64bit.qcow2 
090677ecfc2a6713dbdf8fab9ad082e431a0c6aa  /var/lib/openqa/share/factory/hdd/opensuse-42.2-x86_64-0112-textmode@64bit.qcow2

steps to reproduce

look for children jobs of the image creation ones.

problem

TBD


Related issues 1 (0 open1 closed)

Copied from openQA Project - action #12344: sporadic "corrupt images" in svirt based test on zkvmResolved2016-06-15

Actions
Actions #1

Updated by okurz over 7 years ago

  • Copied from action #12344: sporadic "corrupt images" in svirt based test on zkvm added
Actions #2

Updated by okurz over 7 years ago

  • Assignee set to coolo
  • Priority changed from Normal to Urgent

the issue that DimStar reported regarding "sysread failed: Connection reset by peer" was this one, right?

osukup reported the same today, see http://argus.suse.cz/tests/223/file/autoinst-log.txt at the end

16:01:12.3714 2413 sha1sum f56dd16162e3055c1af33d32e4513cc9273f896c *assets_public/SLES-12-SP1-x86_64-kernel-BUILD-Pepper1.qcow2
EXIT 0
16:01:12.3717 2413 awaiting death of commands process
16:01:12.3716 2416 sysread failed: Connection reset by peer
16:01:12.3835 2413 commands process exited: 2416
Actions #3

Updated by coolo over 7 years ago

the sysread error is from the commands thread which is shut down. I don't think this is related at all to corrupt images

Actions #4

Updated by coolo over 7 years ago

  • Assignee changed from coolo to oholecek

I think what we need to do is verifying uploads - as much as it kills performance, but we need openqa to verify it got the same crc as the worker uploaded. (a fast checksum is good enough, no need to go ballistic with sha1)

Actions #5

Updated by oholecek over 7 years ago

  • Status changed from New to In Progress

If the checksum doesn't match then what? Fail the job or try to reupload? What about checksumming the rest of the assets?
I'm inclined to add checksumming to OpenQA::Worker::Jobs::upload, add checksum column to database and API to query asset info so worker can check for valid checksum.

Actions #6

Updated by oholecek over 7 years ago

Of course the checksum in db is calculated on webui side:
1) Worker uploads the file
2) Worker starts to calculate the local checksum.
3) After WebUI receives the file, webUI starts to calculate the checksum and store to db
4) Worker queries webUI for asset info and compares the checksum.

If checksum mismatch .. fail the job, or invalidate asset and trigger reupload.

Actions #7

Updated by coolo over 7 years ago

this clearly has the advantage that we can reuse the checksum also if we want to replace NFS one point in the future :)

Actions #8

Updated by okurz over 7 years ago

I would say calculation should be done by the worker because it's a "number crunching" task. So yet another data entry that should be added to some json file and uploaded.

Actions #9

Updated by oholecek over 7 years ago

Do you mean something like this?

Job A:
1) worker calculates checksum
2) worker uploads image + expected checksum

Job B:
1) worker calculates checksum of required asset
2) in case of mismatch abort the job

At the end it does the same effect as using wrong asset IMO.

Actions #10

Updated by oholecek over 7 years ago

Or wait until the asset is back-propagated through shared storage and check the checksum there?

The load on the webUI is indeed a concern, esp. with higher worker count. We have about 6 mojo processes thanks to prefork => at max 6 concurrent large image uploads (md5 takes ~45s for 5G image on i7-3820@3.6GHz).

Actions #11

Updated by oholecek over 7 years ago

Or let GRU calculate checksum, in that case however I don't see how we can prevent running chained jobs with wrong asset.

Actions #12

Updated by coolo over 7 years ago

well, the job is only done when the checksum is calculated - but getting a GRU slot can take a while. I want the job to fail when the upload failed - and not the chained jobs.

So let's redefine this a bit:

  • worker uploads assets_public/FOO.qcow2
  • openQA puts it with temporary filename in /var/lib/openqa/share and tells worker resulting filename
  • worker calculates crc32 (md5 is really the worst choice when speed wins over security) of assets_public/FOO.qcow2 and /var/lib/openqa/FOO-temp.qcow2
  • if it matches, tells openqa to rename
  • if it mismatches, job fails with tons of debug :)
Actions #13

Updated by okurz over 7 years ago

Yes, just let the workers calculate the checksum synchronously at the end of job. It does not take ages, it only happens for publishing jobs. The worker should do it because it scales better this way. If we have more jobs and/or more workers then it will still work w/o overloading the webui host (which should be just webui, at best).

Actions #14

Updated by oholecek over 7 years ago

Ok, what about the plan to obsolete NFS at some point?

Right, md5 is slow. cksum gives ~14s, sha1sum ~16s for the same image.

Actions #15

Updated by coolo over 7 years ago

if we obsolete NFS, we will have to redownload the asset to validate it - we do the same now basically

Actions #16

Updated by coolo over 7 years ago

I had the hope that having unique file names would make it less likely, but it's not the case ;(

https://openqa.opensuse.org/tests/237472/file/autoinst-log.txt
says:
sha1sum 1addfc7b73ae49f69dbd6eb3941586a10ffbdd06 *assets_public/opensuse-42.1-x86_64-Updates-20160806-2-cryptlvm@uefi.qcow2
but after upload it's
sha1sum 8576693e5032177c1d26ac9fde21af465fe26363 opensuse-42.1-x86_64-Updates-20160806-2-cryptlvm@uefi.qcow2

Actions #17

Updated by coolo over 7 years ago

https://github.com/os-autoinst/openQA/pull/826 implements what I described in #12

Actions #18

Updated by okurz over 7 years ago

uploading SLES-12-SP2-ppc64le-ha.qcow2
ERROR SLES-12-SP2-ppc64le-ha.qcow2: Connection error: Premature connection close

in https://openqa.suse.de/tests/523391

Actions #20

Updated by okurz over 7 years ago

  • Subject changed from sporadic "corrupt images" in various tests to sporadic "corrupt images" in various tests or fails uploading, e.g. with "Premature connection close"
Actions #21

Updated by mgriessmeier over 7 years ago

latest failing example:
https://openqa.suse.de/tests/528417

Actions #22

Updated by okurz over 7 years ago

  • Has duplicate action #13482: isotovideo process fails to die on job completion, worker becomes stuck added
Actions #23

Updated by okurz over 7 years ago

(wrong ticket referenced)

Actions #24

Updated by okurz over 7 years ago

  • Has duplicate deleted (action #13482: isotovideo process fails to die on job completion, worker becomes stuck)
Actions #25

Updated by AdamWill over 7 years ago

yup, after updating the Fedora staging deployment to current git, I immediately got a couple of jobs with failed disk image uploads:

https://openqa.stg.fedoraproject.org/tests/33982
https://openqa.stg.fedoraproject.org/tests/34004

Note, we are still running on Mojolicious 6. I will try reverting f2547e9bcc0a166f993426bceeacd00179116716 , I guess.

Actions #26

Updated by AdamWill over 7 years ago

https://github.com/os-autoinst/openQA/commit/a50e86ac6654bea9ec0fd32b3b132fb162a79c39 was not sufficient to fix this for our staging deployment (again, remember we're still running on Mojo 6), but along with a further patch that completes the reversion of https://github.com/os-autoinst/openQA/commit/f2547e9bcc0a166f993426bceeacd00179116716 (restoring the submission of the request via $OpenQA::Worker::Common::ua->start() and the recursive IOLoop entry bit), our uploads seem to be working properly again.

Actions #27

Updated by AdamWill over 7 years ago

Per http://mojolicious.org/perldoc/Mojolicious/Guides/FAQ#What-does-Inactivity-timeout-mean , 'premature connection close' could be caused by Mojo's default inactivity timeouts. I'm testing to see if this helps:

From 735d487fb0cb61cd8dbe9715cd6838254e9165d4 Mon Sep 17 00:00:00 2001
From: Adam Williamson <awilliam@redhat.com>
Date: Wed, 31 Aug 2016 08:59:15 -0700
Subject: [PATCH] go back to blocking post for upload, extend inactivity
 timeout

---
 lib/OpenQA/Worker/Jobs.pm | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/lib/OpenQA/Worker/Jobs.pm b/lib/OpenQA/Worker/Jobs.pm
index 2364945..0a43045 100644
--- a/lib/OpenQA/Worker/Jobs.pm
+++ b/lib/OpenQA/Worker/Jobs.pm
@@ -138,15 +138,12 @@ sub upload {
     my $ua_url = $OpenQA::Worker::Common::url->clone;
     $ua_url->path("jobs/$job_id/artefact");

-    my $tx = $OpenQA::Worker::Common::ua->build_tx(POST => $ua_url => form => $form);
-    # override the default boundary calculation - it reads whole file
-    # and it can cause various timeouts
-    my $ct = $tx->req->headers->content_type;
-    my $boundary = encode_base64 join('', map chr(rand 256), 1 .. 32);
-    $boundary =~ s/\W/X/g;
-    $tx->req->headers->content_type("$ct; boundary=$boundary");
-
-    my $res = $OpenQA::Worker::Common::ua->start($tx);
+    # Uploading multi-GB files takes time: extend the inactivity timeout
+    my $origto = $OpenQA::Worker::Common::ua->inactivity_timeout;
+    $OpenQA::Worker::Common::ua->inactivity_timeout(300);
+    my $res = $OpenQA::Worker::Common::ua->post($ua_url => form => $form);
+    # reset inactivity timeout
+    $OpenQA::Worker::Common::ua->inactivity_timeout($origto);

     if (my $err = $res->error) {
         my $msg;
-- 
2.9.3
Actions #28

Updated by AdamWill over 7 years ago

note you're actually getting a lot of "Premature connection close" incompletions on openqa.opensuse.org , but a lot of them get auto-duplicated so maybe it's not totally obvious. But you can query for it:

https://openqa.opensuse.org/api/v1/jobs?result=incomplete&maxage=36000

is just in the last 10 hours. A lot of them seem to be hitting 'Premature connection close' on fairly innocuous-looking file uploads...they often seem to fail on consoletest_setup-loadavg_consoletest_setup.txt , which is like 20 bytes.

Actions #30

Updated by coolo over 7 years ago

I don't think any of the fixes are deployed

Actions #31

Updated by coolo over 7 years ago

  • Status changed from In Progress to Resolved

most of this was most likely caused by the mojo bug fixed in https://github.com/kraih/mojo/commit/02049ddabb93e07699edab6ad28f208d4064daf4 - let's close it as resolved

Actions

Also available in: Atom PDF