Project

General

Profile

action #62459

coordination #62456: [epic] test incompletes after failing in GRU download task on "Inactivity timeout" with no logs

Retry on download errors within GRU download tasks

Added by okurz almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-01-21
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario obs-Unstable-Appliance-x86_64-obs_appliance@64bit-4G fails to download in
GRU on

Gru job failed
Reason: asset download: download of http://download.opensuse.org/repositories/OBS:/Server:/Unstable/images/obs-server.x86_64-2.10.51-qcow2-Build2.438.qcow2 to /var/lib/openqa/share/factory/hdd/obs-server.x86_64-2.10.51-qcow2-Build2.438.qcow2 failed: connection error: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/Task/Asset/Download.pm line 74.

Reproducible

Hard, seems to be related to temporary network problems.

Acceptance criteria

  • GRU download retries automatically on temporary network problems

Suggestions

Similar to what was done for retry in cache asset download in #55529 we could also retry downloads on what seems to be temporary network issues within GRU download jobs.

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Project - action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cacheResolved2019-08-14

Related to openQA Project - action #62159: Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI)Resolved2020-01-15

History

#1 Updated by okurz almost 2 years ago

  • Related to action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache added

#2 Updated by okurz over 1 year ago

  • Related to action #62159: Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI) added

#3 Updated by kraih over 1 year ago

  • Assignee set to kraih

I'll take a look at combining gru and cache service downloads into a shared module. Both do pretty much the same work, and the cache service has a reliable retry feature already. Sharing tests for all the various special cases will also make future maintenance easier.

#4 Updated by cdywan over 1 year ago

  • Target version set to Current Sprint

#5 Updated by kraih over 1 year ago

  • Status changed from Workable to In Progress

Think i have a workable solution now, just need to improve test coverage a bit more before opening the PR.

#7 Updated by kraih over 1 year ago

  • Status changed from In Progress to Feedback

PR has been merged and deployed on O3.

#8 Updated by kraih over 1 year ago

Looked a bit through the O3 logs and it seems to be helping with network errors.

[2020-02-21T01:36:01.0900 UTC] [debug] [#132159] Downloading "http://download.opensuse.org/repositories/KDE:/Medias/images/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" to "/var/lib/openqa/share/factory/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso"
[2020-02-21T01:36:01.0900 UTC] [info] [#132159] Downloading "openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" from "http://download.opensuse.org/repositories/KDE:/Medias/images/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso"
[2020-02-21T01:37:48.0261 UTC] [info] [#132159] Size of "/var/lib/openqa/share/factory/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" differs, expected 982MiB but downloaded 325MiB
[2020-02-21T01:37:48.0378 UTC] [info] [#132159] Download error 598, waiting 5 seconds for next try (4 remaining)
[2020-02-21T01:37:53.0379 UTC] [info] [#132159] Downloading "openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" from "http://download.opensuse.org/repositories/KDE:/Medias/images/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso"
[2020-02-21T01:38:58.0748 UTC] [info] [#132159] Size of "/var/lib/openqa/share/factory/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" differs, expected 982MiB but downloaded 535MiB
[2020-02-21T01:38:58.0873 UTC] [info] [#132159] Download error 598, waiting 5 seconds for next try (3 remaining)
[2020-02-21T01:39:03.0874 UTC] [info] [#132159] Downloading "openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" from "http://download.opensuse.org/repositories/KDE:/Medias/images/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso"
[2020-02-21T01:41:16.0994 UTC] [debug] [#132159] Download of "/var/lib/openqa/share/factory/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" successful

#9 Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

It's great to see that you also looked into the logs of the production instance and even found the cases of retries being triggered so I am confident that the feature works as intended and the ACs are fulfilled.

Also available in: Atom PDF