Project

General

Profile

action #61844

auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)

Added by MDoucha 6 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
2020-01-07
Due date:
% Done:

0%

Estimated time:
Duration:

Description

The cache service on openqaworker-arm-3 frequently fails to download assets with error 521:

[2020-01-05T01:30:22.0405 CET] [info] [pid:49324] Downloading SLES-15-aarch64-minimal_installed_for_LTP.qcow2, request #3191 sent to Cache Service
[2020-01-05T01:30:48.0583 CET] [info] [pid:49324] Download of SLES-15-aarch64-minimal_installed_for_LTP.qcow2 processed:
[info] [#3191] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#3191] Downloading "SLES-15-aarch64-minimal_installed_for_LTP.qcow2" from "openqa.suse.de/tests/3754531/asset/hdd/SLES-15-aarch64-minimal_installed_for_LTP.qcow2"
[info] [#3191] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-aarch64-minimal_installed_for_LTP.qcow2" because the download failed: 521 - Connect timeout

The error may seem rare at first glance but that's most likely because of asset caching on workers. For example, of the last 10 jobs on openqaworker-arm-3:19 (at the time of writing), 2 jobs failed with connect timeout, 2 jobs downloaded at least one asset successfully and 6 jobs ran entirely from cache. It's not clear from logs whether the timeout happens during the initial connection or halfway through downloading a 2GB file.
https://openqa.suse.de/admin/workers/1298

The oldest case confirmed by os-autoinst log is from 2019-12-15: https://openqa.suse.de/tests/3708066
There may have been older cases but their logs have most likely been deleted by now.

I've also looked at 5 instances of openqaworker-arm-1 and found only 3 confirmed cases of the same error. That's low enough to be caused by chance.


Related issues

Related to openQA Project - action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cacheResolved2019-08-14

Blocked by openQA Infrastructure - action #64737: openqaworker-arm-3 is down since 2020-03-16, also IPMI unresponsiveResolved2020-03-24

History

#1 Updated by okurz 6 months ago

  • Subject changed from Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint

So I did two things so far:

#2 Updated by okurz 6 months ago

  • Related to action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache added

#3 Updated by okurz 6 months ago

  • Subject changed from auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)

This seems to be linked to #62237 , also on onther machines, e.g. https://openqa.suse.de/tests/3796147 on arm-1.

#4 Updated by okurz 4 months ago

The SQL query select id,reason,test from jobs where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and id in (select job_id from comments where text ~ 'poo#61844')) order by id desc; doesn't yield any references for the past 10 days so seems like the problem didn't happen again, at least not in the same way or with the same message.

The latest check for incompletes on https://gitlab.suse.de/openqa/auto-review/pipelines in https://gitlab.suse.de/openqa/auto-review/-/jobs/172723 also only shows to other reasons for incompletes.

By now incomplete openQA jobs should also give a "reason" with the relevant information directly available in the info box and available over API (and of course DB). With this the next time we can have an easier time identifying the issue.

We are still running with the reduced number of worker instances.

#5 Updated by okurz 3 months ago

  • Blocked by action #64737: openqaworker-arm-3 is down since 2020-03-16, also IPMI unresponsive added

#6 Updated by okurz 3 months ago

  • Status changed from Feedback to Blocked

I would check again and also increase number of worker instances again but openqaworker-arm-3 is completely down including the management interface, blocked by #64737

#7 Updated by okurz 3 months ago

  • Status changed from Blocked to Resolved

openqaworker-arm-3 is back up, https://github.com/os-autoinst/openQA/pull/2895 should help on retriable errors.

Also available in: Atom PDF