action #55529

job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache

Added by okurz 7 months ago. Updated about 1 month ago.

Status:ResolvedStart date:14/08/2019
Priority:LowDue date:
Assignee:kraih% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

Observation

https://openqa.opensuse.org/tests/1007199/file/autoinst-log.txt shows

[2019-08-14T16:26:13.0435 CEST] [debug] Download of opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 processed: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
[DEBUG] CACHE: Health: Real size: 322103849472, Configured limit: 322122547200
[INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 322103849472
[INFO] Downloading opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 from http://openqa1-opensuse/tests/1007199/asset/hdd/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2
[DEBUG] CACHE: Download of /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 failed with: 521 - Connect timeout
[DEBUG] CACHE: removed /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2

maybe because the worker host just restarted in before?

Suggestion

There could be some retry on "521"


Related issues

Related to openQA Project - action #60866: Periodic stale job detection keeps scheduler busy produci... Resolved 10/12/2019
Related to openQA Project - action #61922: [epic] Incomplete jobs with no logs at all In Progress 03/02/2020
Related to openQA Infrastructure - action #61844: auto_review:"download failed: 521 - Connect timeout" Netw... Feedback 07/01/2020
Related to openQA Project - action #62459: Retry on download errors within GRU download tasks Resolved 21/01/2020
Duplicated by openQA Project - action #60167: "Download .* failed with: 521 - Connect timeout" jobs inc... Rejected 22/11/2019

History

#1 Updated by okurz 3 months ago

  • Related to action #60866: Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all) added

#2 Updated by mkittler about 1 month ago

  • Related to action #61922: [epic] Incomplete jobs with no logs at all added

#3 Updated by okurz about 1 month ago

  • Related to action #61844: auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others) added

#4 Updated by kraih about 1 month ago

Having worked on the cache service for the past two months, i think this is either a sporadic routing issue in the network, or the webui Apache is overloaded at the time. The 521 is not a real HTTP status code, it's just a placeholder the cache service shows when the TCP connection could not be established at all. The timeout for establishing the connection is 10 seconds.

#5 Updated by kraih about 1 month ago

  • Assignee set to kraih

But agreed on the retry logic, it should probably cover this case too. I can take a look, maybe it's not that hard.

#6 Updated by kraih about 1 month ago

  • Status changed from New to Workable

#7 Updated by kraih about 1 month ago

This PR makes the cache service retry downloads for almost every error type. https://github.com/os-autoinst/openQA/pull/2666

#8 Updated by kraih about 1 month ago

  • Status changed from Workable to In Progress

#9 Updated by okurz about 1 month ago

merged and deployed on o3. But we also retry on 404 now? https://openqa.opensuse.org/tests/1146615 shows

[2020-01-17T11:12:23.0562 CET] [info] Download of GNOME_Next.x86_64-3.34.3-Build13.98.iso processed:
[info] [#194488] Cache size of "/var/lib/openqa/cache" is 300GiB, with limit 300GiB
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (4 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (3 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (2 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (1 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Purging "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" because of too many download errors

two observations based on questions I have received from fvogt and guillaume_g on IRC:

  • There should be no retry on 404
  • The error 521 is confusing when on the low level we already know it's 404

#10 Updated by kraih about 1 month ago

#11 Updated by okurz about 1 month ago

  • Duplicated by action #60167: "Download .* failed with: 521 - Connect timeout" jobs incomplete trying to download from cache, potentially worker specific added

#12 Updated by okurz about 1 month ago

  • Related to action #62459: Retry on download errors within GRU download tasks added

#13 Updated by okurz about 1 month ago

  • Status changed from In Progress to Resolved
  • Target version set to Done

no fallout on o3. I guess we can call this "Resolved" then.

https://openqa.opensuse.org/tests/1150025 is the old state showing multiple retries for 404, https://openqa.opensuse.org/tests/1150570 now shows the expected behaviour of just a single retry for 404. I did not find retries for 5xx cases now, still I think we are done. Reopen if you think differently.

Also available in: Atom PDF