action #62459
closedcoordination #62456: [epic] test incompletes after failing in GRU download task on "Inactivity timeout" with no logs
Retry on download errors within GRU download tasks
0%
Description
Observation¶
openQA test in scenario obs-Unstable-Appliance-x86_64-obs_appliance@64bit-4G fails to download in
GRU on
Gru job failed
Reason: asset download: download of http://download.opensuse.org/repositories/OBS:/Server:/Unstable/images/obs-server.x86_64-2.10.51-qcow2-Build2.438.qcow2 to /var/lib/openqa/share/factory/hdd/obs-server.x86_64-2.10.51-qcow2-Build2.438.qcow2 failed: connection error: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/Task/Asset/Download.pm line 74.
Reproducible¶
Hard, seems to be related to temporary network problems.
Acceptance criteria¶
- GRU download retries automatically on temporary network problems
Suggestions¶
Similar to what was done for retry in cache asset download in #55529 we could also retry downloads on what seems to be temporary network issues within GRU download jobs.
Further details¶
Always latest result in this scenario: latest
Updated by okurz over 4 years ago
- Related to action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache added
Updated by okurz over 4 years ago
- Related to action #62159: Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI) added
Updated by kraih over 4 years ago
- Assignee set to kraih
I'll take a look at combining gru and cache service downloads into a shared module. Both do pretty much the same work, and the cache service has a reliable retry feature already. Sharing tests for all the various special cases will also make future maintenance easier.
Updated by kraih over 4 years ago
- Status changed from Workable to In Progress
Think i have a workable solution now, just need to improve test coverage a bit more before opening the PR.
Updated by kraih over 4 years ago
Opened a PR. https://github.com/os-autoinst/openQA/pull/2736
Updated by kraih over 4 years ago
- Status changed from In Progress to Feedback
PR has been merged and deployed on O3.
Updated by kraih over 4 years ago
Looked a bit through the O3 logs and it seems to be helping with network errors.
[2020-02-21T01:36:01.0900 UTC] [debug] [#132159] Downloading "http://download.opensuse.org/repositories/KDE:/Medias/images/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" to "/var/lib/openqa/share/factory/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso"
[2020-02-21T01:36:01.0900 UTC] [info] [#132159] Downloading "openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" from "http://download.opensuse.org/repositories/KDE:/Medias/images/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso"
[2020-02-21T01:37:48.0261 UTC] [info] [#132159] Size of "/var/lib/openqa/share/factory/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" differs, expected 982MiB but downloaded 325MiB
[2020-02-21T01:37:48.0378 UTC] [info] [#132159] Download error 598, waiting 5 seconds for next try (4 remaining)
[2020-02-21T01:37:53.0379 UTC] [info] [#132159] Downloading "openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" from "http://download.opensuse.org/repositories/KDE:/Medias/images/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso"
[2020-02-21T01:38:58.0748 UTC] [info] [#132159] Size of "/var/lib/openqa/share/factory/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" differs, expected 982MiB but downloaded 535MiB
[2020-02-21T01:38:58.0873 UTC] [info] [#132159] Download error 598, waiting 5 seconds for next try (3 remaining)
[2020-02-21T01:39:03.0874 UTC] [info] [#132159] Downloading "openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" from "http://download.opensuse.org/repositories/KDE:/Medias/images/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso"
[2020-02-21T01:41:16.0994 UTC] [debug] [#132159] Download of "/var/lib/openqa/share/factory/iso/openSUSE_Krypton.x86_64-5.12.80-Build16.9.iso" successful
Updated by okurz over 4 years ago
- Status changed from Feedback to Resolved
It's great to see that you also looked into the logs of the production instance and even found the cases of retries being triggered so I am confident that the feature works as intended and the ACs are fulfilled.