Project

General

Profile

Actions

action #61844

closed

auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)

Added by MDoucha over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2020-01-07
Due date:
% Done:

0%

Estimated time:

Description

The cache service on openqaworker-arm-3 frequently fails to download assets with error 521:

[2020-01-05T01:30:22.0405 CET] [info] [pid:49324] Downloading SLES-15-aarch64-minimal_installed_for_LTP.qcow2, request #3191 sent to Cache Service
[2020-01-05T01:30:48.0583 CET] [info] [pid:49324] Download of SLES-15-aarch64-minimal_installed_for_LTP.qcow2 processed:
[info] [#3191] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#3191] Downloading "SLES-15-aarch64-minimal_installed_for_LTP.qcow2" from "openqa.suse.de/tests/3754531/asset/hdd/SLES-15-aarch64-minimal_installed_for_LTP.qcow2"
[info] [#3191] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-aarch64-minimal_installed_for_LTP.qcow2" because the download failed: 521 - Connect timeout

The error may seem rare at first glance but that's most likely because of asset caching on workers. For example, of the last 10 jobs on openqaworker-arm-3:19 (at the time of writing), 2 jobs failed with connect timeout, 2 jobs downloaded at least one asset successfully and 6 jobs ran entirely from cache. It's not clear from logs whether the timeout happens during the initial connection or halfway through downloading a 2GB file.
https://openqa.suse.de/admin/workers/1298

The oldest case confirmed by os-autoinst log is from 2019-12-15: https://openqa.suse.de/tests/3708066
There may have been older cases but their logs have most likely been deleted by now.

I've also looked at 5 instances of openqaworker-arm-1 and found only 3 confirmed cases of the same error. That's low enough to be caused by chance.


Related issues 2 (0 open2 closed)

Related to openQA Project - action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cacheResolvedkraih2019-08-14

Actions
Blocked by openQA Infrastructure - action #64737: openqaworker-arm-3 is down since 2020-03-16, also IPMI unresponsiveResolvedokurz2020-03-24

Actions
Actions #1

Updated by okurz over 4 years ago

  • Subject changed from Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint

So I did two things so far:

Actions #2

Updated by okurz over 4 years ago

  • Related to action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache added
Actions #3

Updated by okurz over 4 years ago

  • Subject changed from auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)

This seems to be linked to #62237 , also on onther machines, e.g. https://openqa.suse.de/tests/3796147 on arm-1.

Actions #4

Updated by okurz about 4 years ago

The SQL query select id,reason,test from jobs where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and id in (select job_id from comments where text ~ 'poo#61844')) order by id desc; doesn't yield any references for the past 10 days so seems like the problem didn't happen again, at least not in the same way or with the same message.

The latest check for incompletes on https://gitlab.suse.de/openqa/auto-review/pipelines in https://gitlab.suse.de/openqa/auto-review/-/jobs/172723 also only shows to other reasons for incompletes.

By now incomplete openQA jobs should also give a "reason" with the relevant information directly available in the info box and available over API (and of course DB). With this the next time we can have an easier time identifying the issue.

We are still running with the reduced number of worker instances.

Actions #5

Updated by okurz about 4 years ago

  • Blocked by action #64737: openqaworker-arm-3 is down since 2020-03-16, also IPMI unresponsive added
Actions #6

Updated by okurz about 4 years ago

  • Status changed from Feedback to Blocked

I would check again and also increase number of worker instances again but openqaworker-arm-3 is completely down including the management interface, blocked by #64737

Actions #7

Updated by okurz about 4 years ago

  • Status changed from Blocked to Resolved

openqaworker-arm-3 is back up, https://github.com/os-autoinst/openQA/pull/2895 should help on retriable errors.

Actions

Also available in: Atom PDF