Project

General

Profile

action #73285

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

coordination #62420: [epic] Distinguish all types of incompletes

test incompletes with auto_review:"(?s)Download of.*processed[^:].*Failed to download":retry , not helpful details about reason of error

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-07-30
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.suse.de/tests/4811018/file/autoinst-log.txt

shows just

[2020-10-12T17:00:42.0701 CEST] [info] [pid:36912] Downloading SLES-12-SP5-x86_64-mru-install-minimal-with-addons-Build:16741:php7-Server-DVD-Incidents-64bit.qcow2, request #42551 sent to Cache Service
[2020-10-12T17:00:47.0773 CEST] [info] [pid:36912] Download of SLES-12-SP5-x86_64-mru-install-minimal-with-addons-Build:16741:php7-Server-DVD-Incidents-64bit.qcow2 processed
[2020-10-12T17:00:47.0778 CEST] [error] [pid:36912] Failed to download SLES-12-SP5-x86_64-mru-install-minimal-with-addons-Build:16741:php7-Server-DVD-Incidents-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-x86_64-mru-install-minimal-with-addons-Build:16741:php7-Server-DVD-Incidents-64bit.qcow2
[2020-10-12T17:00:47.0826 CEST] [info] [pid:36912] +++ worker notes +++
[2020-10-12T17:00:47.0826 CEST] [info] [pid:36912] End time: 2020-10-12 15:00:47
[2020-10-12T17:00:47.0826 CEST] [info] [pid:36912] Result: setup failure

and the worker log https://openqa.suse.de/tests/4811018/file/worker-log.txt

[2020-10-12T17:00:47.0778 CEST] [error] [pid:36912] Unable to setup job 4811018: Failed to download SLES-12-SP5-x86_64-mru-install-minimal-with-addons-Build:16741:php7-Server-DVD-Incidents-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-x86_64-mru-install-minimal-with-addons-Build:16741:php7-Server-DVD-Incidents-64bit.qcow2
[2020-10-12T17:00:47.0778 CEST] [debug] [pid:36912] Stopping job 4811018 from openqa.suse.de: 04811018-sle-12-SP5-Server-DVD-Incidents-x86_64-Build:16741:php7-mau-extratests-kdump@64bit - reason: setup failure
[2020-10-12T17:00:47.0778 CEST] [debug] [pid:36912] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4811018/status
[2020-10-12T17:00:47.0859 CEST] [info] [pid:25653] Uploading autoinst-log.txt
[2020-10-12T17:00:47.0932 CEST] [info] [pid:25653] Uploading worker-log.txt

there is no clue on why "SLES-12-SP5-x86_64-mru-install-minimal-with-addons-Build:16741:php7-Server-DVD-Incidents-64bit.qcow2" could not be downloaded, from where to where it should have been downloaded, etc. There is no http status code or anything, like no "404 Not Found" or similar.

Acceptance criteria

  • AC1: The real reason for the download error is accessible in the job details (either "reason" or autoinst-log.txt), e.g. http status code with message

Suggestions

  • Research what the actual error could be and add it as a "likely explanation" in the existing error string
  • Try to include the actual error details into the affected jobs logs
  • If possible, adapt the "reason" to be more specific with details about the actual error

Related issues

Related to openQA Project - action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloadedResolved2020-07-30

Copied from openQA Project - action #69448: test incompletes with auto_review:"(?s)was downloaded by.*details.*unavailable here.*Failed to download":retry , not helpful detailsWorkable2020-07-30

History

#1 Updated by okurz 3 months ago

  • Copied from action #69448: test incompletes with auto_review:"(?s)was downloaded by.*details.*unavailable here.*Failed to download":retry , not helpful details added

#2 Updated by okurz 3 months ago

  • Subject changed from test incompletes with auto_review:"(?s)Download of.*processed$.*Failed to download":retry , not helpful details about reason of error to test incompletes with auto_review:"(?s)Download of.*processed[^:].*Failed to download":retry , not helpful details about reason of error

#3 Updated by okurz 3 months ago

  • Parent task set to #62420

#4 Updated by Xiaojing_liu 3 months ago

According to the log message, there is no output and error message. We could see that from
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/Engines/isotovideo.pm#L156

    my $msg = "Download of $asset_uri processed";
        if (my $output = $status->output) { $msg .= ":\n$output" }
        log_info($msg, channels => 'autoinst');

        $asset = $cache_client->asset_path($webui_host, $asset_uri)
          if $cache_client->asset_exists($webui_host, $asset_uri);
        ... ...
        if (!$asset) {
            $error = "Failed to download $asset_uri to " . $cache_client->asset_path($webui_host, $asset_uri);
            ... ...
        }

Maybe we could add the $status->result in this log message to see if the result is empty too.
Since the code has set the result:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/CacheService/Controller/API.pm#L38

if ($info->{state} eq 'finished' && !$self->progress->is_downloading($info->{notes}{lock})) {
        $status = {status => 'processed', result => $info->{result}, output => $info->{notes}{output}};

I have not found a way to reproduce this issue.

#5 Updated by mkittler 3 months ago

Apparently there's not error to be reported from the view of the cache service but the asset isn't there after all. Without further information it looks like the asset might have been purged in the meantime. That's what the check @okurz added is supposed to catch. However, I've already noted back then that it will not work in all cases.

#6 Updated by okurz 3 months ago

  • Related to action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded added

#7 Updated by okurz 3 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

mkittler wrote:

Apparently there's not error to be reported from the view of the cache service but the asset isn't there after all.

Do you mean that the specific example from the specific example above is not there right now? If this is the case it could certainly have been there at the time of ticket reporting. Or did you mean something different?

Without further information it looks like the asset might have been purged in the meantime.

which we probably would see in logs

That's what the check @okurz added is supposed to catch. However, I've already noted back then that it will not work in all cases.

Right. It could be that the cache service knows that the asset was pruned but that this was not reported in the job in question here.

Well, as it could be that this is the same root problem as in #71827 we can also wait for that one first

#8 Updated by mkittler 3 months ago

Do you mean that the specific example from the specific example above is not there right now?

Yes, I mean the example from the ticket description. In this example the log message Failed to download… is present. That means the cache service did not report an error because otherwise this error would have been logged and the line generic Failed to download… line not being logged. The generic line is only logged if the cache service reported no error but despite that the file isn't there.

which we probably would see in logs

That's what I mean with "without further information". I haven't looked into the cache service log yet. I suppose the worker log (which is the only log mentioned in the ticket) would not necessarily mention if the asset was purged. (It could have been purged to make room for a completely different asset which has been requested concurrently.)

Also available in: Atom PDF