Project

General

Profile

Actions

action #57782

closed

retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry download

Added by okurz over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2019-10-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.opensuse.org/tests/1050658 is incomplete with a failed GRU asset download task. The asset URL http://download.opensuse.org/repositories/KDE:/Medias/images/iso/Argon.x86_64-15.1-Build2.106.iso mentioned in the job yields a properly downloadable ISO file. A retriggered job https://openqa.opensuse.org/tests/1050952 ends up incomplete with no apparent try to actually retry the download

Wishes

  1. The error feedback in the original job should be more verbose about the underlying problem of why a download failed, e.g. "404 error" or others
  2. The cloned job should retry the download
  3. The cloned job should use a similar mean of providing feedback with a detail box in "#Details", not just in autoinst-log.txt
  4. The cloned

Related issues 4 (0 open4 closed)

Related to openQA Project - action #57776: "log_fatal" should mention calling method, not the log message handler itselfResolvedtinita2019-10-08

Actions
Related to openQA Project - action #46742: test incompletes trying to revert to qemu snapshot auto_review:"Could not open backing file: Could not open .*.qcow.*No such file or directory", likely premature deletion of files from cacheResolvedokurz2019-01-282020-02-18

Actions
Related to openQA Project - action #62159: Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI)ResolvedXiaojing_liu2020-01-15

Actions
Related to openQA Project - coordination #62420: [epic] Distinguish all types of incompletesResolvedokurz2018-12-12

Actions
Actions #1

Updated by okurz over 4 years ago

  • Related to action #57776: "log_fatal" should mention calling method, not the log message handler itself added
Actions #2

Updated by coolo over 4 years ago

  • Priority changed from Normal to Low
  • Target version set to Ready

I wouldn't restart the download from retriggering jobs. Everyone who knows the retriggering jobs code will agree :)

But the download job details could be exposed better - including a way to restart it. For now this is low prio though

Actions #3

Updated by mkittler over 4 years ago

  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint
Actions #4

Updated by mkittler over 4 years ago

  • Status changed from New to In Progress

PR for improving the error message and the overall asset download code: https://github.com/os-autoinst/openQA/pull/2581

For re-triggering we needed to the way the error is displayed and add a route to trigger that single download. But for the user it would be better to enqueue the download again when restarting the job. I'm wondering whether that would be really so terrible from the code perspective.

Actions #5

Updated by AdamWill over 4 years ago

FWIW I do find it annoying when I want to clone an old job for which the assets have been garbage collected and it fails because a downloadable asset isn't there; it'd be nice to retry the asset download if the asset is not present. But IIRC it'd be quite a pain to implement...

Actions #6

Updated by okurz over 4 years ago

  • Related to action #46742: test incompletes trying to revert to qemu snapshot auto_review:"Could not open backing file: Could not open .*.qcow.*No such file or directory", likely premature deletion of files from cache added
Actions #7

Updated by okurz over 4 years ago

recent story from #opensuse-factory

<DimStar> https://openqa.opensuse.org/tests/1146615 - download error 521? what is that supposed to mean
<fvogt> Apparently with the new yaml job groups the medium types configuration is not used anymore?
<fvogt> aarch32-HD24G it is then
<guillaume_g> fvogt: no idea. okurz ^ ?
<guillaume_g> fvogt: anyway, aarch32-HD24G is fine and consistent with JeOS aarch64
<okurz> fvogt, guillaume_g: 521 on top level, one line above you should see 404 so I assume the file is simply not there
<okurz> kraih: we also retry on 404 now? https://openqa.opensuse.org/tests/1146615
<fvogt> guillaume_g: Something broke with that machine: https://openqa.opensuse.org/tests/1146627#settings
<DimStar> okurz: ah, I see. so o3 actually failed to download the medium from download.o.o - but still registered the jobs. From the logs:
<DimStar> openqa_gru:[2020-01-17T07:43:02.0085 UTC] [fatal] [pid:6454] asset download: download of http://download.opensuse.org/repositories/GNOME:/Medias/images/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso to /var/lib/openqa/share/factory/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso failed: connection error: Inactivity timeout
<kraih> okurz: looks like it yes, all 4xx codes
Actions #8

Updated by mkittler over 4 years ago

  • Related to action #62159: Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI) added
Actions #9

Updated by okurz over 4 years ago

Actions #10

Updated by okurz about 4 years ago

  • Priority changed from Low to Normal

https://openqa.opensuse.org/tests/1214438 looks as if is of the same kind, I'm not 100% sure on this though.

Actions #11

Updated by mkittler about 4 years ago

  • Status changed from In Progress to Resolved
  • Target version deleted (Current Sprint)

@okurz I currently can not access the job because o3 is very slow.

I'm closing this ticket. There is another ticket which is about the same problem but has a different solution in mind. I implemented the other solution now. Also see https://github.com/os-autoinst/openQA/pull/2676#issuecomment-578792752.

Actions

Also available in: Atom PDF