auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)
The cache service on openqaworker-arm-3 frequently fails to download assets with error 521:
[2020-01-05T01:30:22.0405 CET] [info] [pid:49324] Downloading SLES-15-aarch64-minimal_installed_for_LTP.qcow2, request #3191 sent to Cache Service [2020-01-05T01:30:48.0583 CET] [info] [pid:49324] Download of SLES-15-aarch64-minimal_installed_for_LTP.qcow2 processed: [info] [#3191] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB [info] [#3191] Downloading "SLES-15-aarch64-minimal_installed_for_LTP.qcow2" from "openqa.suse.de/tests/3754531/asset/hdd/SLES-15-aarch64-minimal_installed_for_LTP.qcow2" [info] [#3191] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-aarch64-minimal_installed_for_LTP.qcow2" because the download failed: 521 - Connect timeout
The error may seem rare at first glance but that's most likely because of asset caching on workers. For example, of the last 10 jobs on openqaworker-arm-3:19 (at the time of writing), 2 jobs failed with connect timeout, 2 jobs downloaded at least one asset successfully and 6 jobs ran entirely from cache. It's not clear from logs whether the timeout happens during the initial connection or halfway through downloading a 2GB file.
The oldest case confirmed by os-autoinst log is from 2019-12-15: https://openqa.suse.de/tests/3708066
There may have been older cases but their logs have most likely been deleted by now.
I've also looked at 5 instances of openqaworker-arm-1 and found only 3 confirmed cases of the same error. That's low enough to be caused by chance.
- Subject changed from Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3
- Status changed from New to Feedback
- Assignee set to okurz
- Target version set to Current Sprint
- Subject changed from auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)
The SQL query
select id,reason,test from jobs where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and id in (select job_id from comments where text ~ 'poo#61844')) order by id desc; doesn't yield any references for the past 10 days so seems like the problem didn't happen again, at least not in the same way or with the same message.
The latest check for incompletes on https://gitlab.suse.de/openqa/auto-review/pipelines in https://gitlab.suse.de/openqa/auto-review/-/jobs/172723 also only shows to other reasons for incompletes.
By now incomplete openQA jobs should also give a "reason" with the relevant information directly available in the info box and available over API (and of course DB). With this the next time we can have an easier time identifying the issue.
We are still running with the reduced number of worker instances.