Project

General

Profile

Actions

action #55529

closed

job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2019-08-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.opensuse.org/tests/1007199/file/autoinst-log.txt shows

[2019-08-14T16:26:13.0435 CEST] [debug] Download of opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 processed: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
[DEBUG] CACHE: Health: Real size: 322103849472, Configured limit: 322122547200
[INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 322103849472
[INFO] Downloading opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 from http://openqa1-opensuse/tests/1007199/asset/hdd/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2
[DEBUG] CACHE: Download of /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2 failed with: 521 - Connect timeout
[DEBUG] CACHE: removed /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.1-x86_64-Updates-20170213-1-gnome@64bit.qcow2

maybe because the worker host just restarted in before?

Suggestion

There could be some retry on "521"


Related issues 5 (0 open5 closed)

Related to openQA Project - action #60866: Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all)Resolvedmkittler2019-12-10

Actions
Related to openQA Project - coordination #61922: [epic] Incomplete jobs with no logs at allResolvedmkittler2020-02-03

Actions
Related to openQA Infrastructure - action #61844: auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)Resolvedokurz2020-01-07

Actions
Related to openQA Project - action #62459: Retry on download errors within GRU download tasksResolvedkraih2020-01-21

Actions
Has duplicate openQA Project - action #60167: "Download .* failed with: 521 - Connect timeout" jobs incomplete trying to download from cache, potentially worker specificRejectedokurz2019-11-22

Actions
Actions #1

Updated by okurz over 4 years ago

  • Related to action #60866: Periodic stale job detection keeps scheduler busy producing a massive number of incomplete jobs (was: Again many incomplete jobs with no logs at all) added
Actions #2

Updated by mkittler over 4 years ago

Actions #3

Updated by okurz over 4 years ago

  • Related to action #61844: auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others) added
Actions #4

Updated by kraih over 4 years ago

Having worked on the cache service for the past two months, i think this is either a sporadic routing issue in the network, or the webui Apache is overloaded at the time. The 521 is not a real HTTP status code, it's just a placeholder the cache service shows when the TCP connection could not be established at all. The timeout for establishing the connection is 10 seconds.

Actions #5

Updated by kraih over 4 years ago

  • Assignee set to kraih

But agreed on the retry logic, it should probably cover this case too. I can take a look, maybe it's not that hard.

Actions #6

Updated by kraih over 4 years ago

  • Status changed from New to Workable
Actions #7

Updated by kraih over 4 years ago

This PR makes the cache service retry downloads for almost every error type. https://github.com/os-autoinst/openQA/pull/2666

Actions #8

Updated by kraih over 4 years ago

  • Status changed from Workable to In Progress
Actions #9

Updated by okurz over 4 years ago

merged and deployed on o3. But we also retry on 404 now? https://openqa.opensuse.org/tests/1146615 shows

[2020-01-17T11:12:23.0562 CET] [info] Download of GNOME_Next.x86_64-3.34.3-Build13.98.iso processed:
[info] [#194488] Cache size of "/var/lib/openqa/cache" is 300GiB, with limit 300GiB
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (4 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (3 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (2 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Download error 521, waiting 5 seconds for next try (1 remaining)
[info] [#194488] Downloading "GNOME_Next.x86_64-3.34.3-Build13.98.iso" from "http://openqa1-opensuse/tests/1146615/asset/iso/GNOME_Next.x86_64-3.34.3-Build13.98.iso"
[info] [#194488] Download of "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" failed: 404 - Not Found
[info] [#194488] Purging "/var/lib/openqa/cache/openqa1-opensuse/GNOME_Next.x86_64-3.34.3-Build13.98.iso" because of too many download errors

two observations based on questions I have received from fvogt and guillaume_g on IRC:

  • There should be no retry on 404
  • The error 521 is confusing when on the low level we already know it's 404
Actions #10

Updated by kraih over 4 years ago

Actions #11

Updated by okurz over 4 years ago

  • Has duplicate action #60167: "Download .* failed with: 521 - Connect timeout" jobs incomplete trying to download from cache, potentially worker specific added
Actions #12

Updated by okurz over 4 years ago

  • Related to action #62459: Retry on download errors within GRU download tasks added
Actions #13

Updated by okurz over 4 years ago

  • Status changed from In Progress to Resolved
  • Target version set to Done

no fallout on o3. I guess we can call this "Resolved" then.

https://openqa.opensuse.org/tests/1150025 is the old state showing multiple retries for 404, https://openqa.opensuse.org/tests/1150570 now shows the expected behaviour of just a single retry for 404. I did not find retries for 5xx cases now, still I think we are done. Reopen if you think differently.

Actions

Also available in: Atom PDF