action #61844

auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)

Added by MDoucha about 1 month ago. Updated 2 days ago.

Status:FeedbackStart date:07/01/2020
Priority:NormalDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Description

The cache service on openqaworker-arm-3 frequently fails to download assets with error 521:

[2020-01-05T01:30:22.0405 CET] [info] [pid:49324] Downloading SLES-15-aarch64-minimal_installed_for_LTP.qcow2, request #3191 sent to Cache Service
[2020-01-05T01:30:48.0583 CET] [info] [pid:49324] Download of SLES-15-aarch64-minimal_installed_for_LTP.qcow2 processed:
[info] [#3191] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#3191] Downloading "SLES-15-aarch64-minimal_installed_for_LTP.qcow2" from "openqa.suse.de/tests/3754531/asset/hdd/SLES-15-aarch64-minimal_installed_for_LTP.qcow2"
[info] [#3191] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-aarch64-minimal_installed_for_LTP.qcow2" because the download failed: 521 - Connect timeout

The error may seem rare at first glance but that's most likely because of asset caching on workers. For example, of the last 10 jobs on openqaworker-arm-3:19 (at the time of writing), 2 jobs failed with connect timeout, 2 jobs downloaded at least one asset successfully and 6 jobs ran entirely from cache. It's not clear from logs whether the timeout happens during the initial connection or halfway through downloading a 2GB file.
https://openqa.suse.de/admin/workers/1298

The oldest case confirmed by os-autoinst log is from 2019-12-15: https://openqa.suse.de/tests/3708066
There may have been older cases but their logs have most likely been deleted by now.

I've also looked at 5 instances of openqaworker-arm-1 and found only 3 confirmed cases of the same error. That's low enough to be caused by chance.


Related issues

Related to openQA Project - action #55529: job incompletes when it can not reach the openqa webui ho... Resolved 14/08/2019

History

#1 Updated by okurz about 1 month ago

  • Subject changed from Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint

So I did two things so far:

#2 Updated by okurz about 1 month ago

  • Related to action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache added

#3 Updated by okurz about 1 month ago

  • Subject changed from auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)

This seems to be linked to #62237 , also on onther machines, e.g. https://openqa.suse.de/tests/3796147 on arm-1.

#4 Updated by okurz 2 days ago

The SQL query select id,reason,test from jobs where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and id in (select job_id from comments where text ~ 'poo#61844')) order by id desc; doesn't yield any references for the past 10 days so seems like the problem didn't happen again, at least not in the same way or with the same message.

The latest check for incompletes on https://gitlab.suse.de/openqa/auto-review/pipelines in https://gitlab.suse.de/openqa/auto-review/-/jobs/172723 also only shows to other reasons for incompletes.

By now incomplete openQA jobs should also give a "reason" with the relevant information directly available in the info box and available over API (and of course DB). With this the next time we can have an easier time identifying the issue.

We are still running with the reduced number of worker instances.

Also available in: Atom PDF