action #55529
closed
job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache
Added by okurz over 5 years ago.
Updated almost 5 years ago.
Category:
Feature requests
Description
Observation¶
https://openqa.opensuse.org/tests/1007199/file/autoinst-log.txt shows
[2019-08-14T16:26:13.0435 CEST] [debug] Download of opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 processed: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
[DEBUG] CACHE: Health: Real size: 322103849472, Configured limit: 322122547200
[INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 322103849472
[INFO] Downloading opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 from http://openqa1-opensuse/tests/1007199/asset/hdd/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2
[DEBUG] CACHE: Download of /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 failed with: 521 - Connect timeout
[DEBUG] CACHE: removed /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2
maybe because the worker host just restarted in before?
Suggestion¶
There could be some retry on "521"
- Related to action #60866: Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all) added
- Related to action #61844: auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others) added
Having worked on the cache service for the past two months, i think this is either a sporadic routing issue in the network, or the webui Apache is overloaded at the time. The 521 is not a real HTTP status code, it's just a placeholder the cache service shows when the TCP connection could not be established at all. The timeout for establishing the connection is 10 seconds.
But agreed on the retry logic, it should probably cover this case too. I can take a look, maybe it's not that hard.
- Status changed from New to Workable
- Status changed from Workable to In Progress
merged and deployed on o3. But we also retry on 404 now? https://openqa.opensuse.org/tests/1146615 shows
[2020-01-17T11:12:23.0562 CET] [info] Download of GNOME_Next.x86_64-3.34.3-Build13.98.iso processed:
[info] [#194488] Cache size of "/var/lib/openqa/cache" is 300GiB, with limit 300GiB
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (4 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (3 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (2 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (1 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Purging "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" because of too many download errors
two observations based on questions I have received from fvogt and guillaume_g on IRC:
- There should be no retry on 404
- The error 521 is confusing when on the low level we already know it's 404
- Has duplicate action #60167: "Download .* failed with: 521 - Connect timeout" jobs incomplete trying to download from cache, potentially worker specific added
- Related to action #62459: Retry on download errors within GRU download tasks added
- Status changed from In Progress to Resolved
- Target version set to Done
Also available in: Atom
PDF