action #55529
closedjob incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache
Description
Observation¶
https://openqa.opensuse.org/tests/1007199/file/autoinst-log.txt shows
[2019-08-14T16:26:13.0435 CEST] [debug] Download of opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 processed: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
[DEBUG] CACHE: Health: Real size: 322103849472, Configured limit: 322122547200
[INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 322103849472
[INFO] Downloading opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 from http://openqa1-opensuse/tests/1007199/asset/hdd/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2
[DEBUG] CACHE: Download of /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 failed with: 521 - Connect timeout
[DEBUG] CACHE: removed /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2
maybe because the worker host just restarted in before?
Suggestion¶
There could be some retry on "521"
Updated by okurz over 4 years ago
- Related to action #60866: Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all) added
Updated by mkittler over 4 years ago
- Related to coordination #61922: [epic] Incomplete jobs with no logs at all added
Updated by okurz over 4 years ago
- Related to action #61844: auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others) added
Updated by kraih over 4 years ago
Having worked on the cache service for the past two months, i think this is either a sporadic routing issue in the network, or the webui Apache is overloaded at the time. The 521 is not a real HTTP status code, it's just a placeholder the cache service shows when the TCP connection could not be established at all. The timeout for establishing the connection is 10 seconds.
Updated by kraih over 4 years ago
- Assignee set to kraih
But agreed on the retry logic, it should probably cover this case too. I can take a look, maybe it's not that hard.
Updated by kraih over 4 years ago
This PR makes the cache service retry downloads for almost every error type. https://github.com/os-autoinst/openQA/pull/2666
Updated by okurz over 4 years ago
merged and deployed on o3. But we also retry on 404 now? https://openqa.opensuse.org/tests/1146615 shows
[2020-01-17T11:12:23.0562 CET] [info] Download of GNOME_Next.x86_64-3.34.3-Build13.98.iso processed:
[info] [#194488] Cache size of "/var/lib/openqa/cache" is 300GiB, with limit 300GiB
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (4 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (3 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (2 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (1 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Purging "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" because of too many download errors
two observations based on questions I have received from fvogt and guillaume_g on IRC:
- There should be no retry on 404
- The error 521 is confusing when on the low level we already know it's 404
Updated by kraih over 4 years ago
PR opened for 4xx handling. https://github.com/os-autoinst/openQA/pull/2675
Updated by okurz over 4 years ago
- Has duplicate action #60167: "Download .* failed with: 521 - Connect timeout" jobs incomplete trying to download from cache, potentially worker specific added
Updated by okurz over 4 years ago
- Related to action #62459: Retry on download errors within GRU download tasks added
Updated by okurz over 4 years ago
- Status changed from In Progress to Resolved
- Target version set to Done
no fallout on o3. I guess we can call this "Resolved" then.
https://openqa.opensuse.org/tests/1150025 is the old state showing multiple retries for 404, https://openqa.opensuse.org/tests/1150570 now shows the expected behaviour of just a single retry for 404. I did not find retries for 5xx cases now, still I think we are done. Reopen if you think differently.