action #61844
closedauto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)
0%
Description
The cache service on openqaworker-arm-3 frequently fails to download assets with error 521:
[2020-01-05T01:30:22.0405 CET] [info] [pid:49324] Downloading SLES-15-aarch64-minimal_installed_for_LTP.qcow2, request #3191 sent to Cache Service
[2020-01-05T01:30:48.0583 CET] [info] [pid:49324] Download of SLES-15-aarch64-minimal_installed_for_LTP.qcow2 processed:
[info] [#3191] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#3191] Downloading "SLES-15-aarch64-minimal_installed_for_LTP.qcow2" from "openqa.suse.de/tests/3754531/asset/hdd/SLES-15-aarch64-minimal_installed_for_LTP.qcow2"
[info] [#3191] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-aarch64-minimal_installed_for_LTP.qcow2" because the download failed: 521 - Connect timeout
The error may seem rare at first glance but that's most likely because of asset caching on workers. For example, of the last 10 jobs on openqaworker-arm-3:19 (at the time of writing), 2 jobs failed with connect timeout, 2 jobs downloaded at least one asset successfully and 6 jobs ran entirely from cache. It's not clear from logs whether the timeout happens during the initial connection or halfway through downloading a 2GB file.
https://openqa.suse.de/admin/workers/1298
The oldest case confirmed by os-autoinst log is from 2019-12-15: https://openqa.suse.de/tests/3708066
There may have been older cases but their logs have most likely been deleted by now.
I've also looked at 5 instances of openqaworker-arm-1 and found only 3 confirmed cases of the same error. That's low enough to be caused by chance.
Updated by okurz over 5 years ago
- Subject changed from Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3
- Status changed from New to Feedback
- Assignee set to okurz
- Target version set to Current Sprint
So I did two things so far:
- Change ticket to be picked up by https://gitlab.suse.de/openqa/auto-review/
- Reduce number of worker instances on arm-3 to 4 for #41882 but also to see if this has an impact on stability
Updated by okurz over 5 years ago
- Related to action #55529: job incompletes when it can not reach the openqa webui host just for a single time aka. retry on 521 connect timeout in cache added
Updated by okurz over 5 years ago
- Subject changed from auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 to auto_review:"download failed: 521 - Connect timeout" Network issues on openqaworker-arm-3 (and others)
This seems to be linked to #62237 , also on onther machines, e.g. https://openqa.suse.de/tests/3796147 on arm-1.
Updated by okurz about 5 years ago
The SQL query select id,reason,test from jobs where (result='incomplete' and t_finished >= (NOW() - interval '240 hour') and id in (select job_id from comments where text ~ 'poo#61844')) order by id desc;
doesn't yield any references for the past 10 days so seems like the problem didn't happen again, at least not in the same way or with the same message.
The latest check for incompletes on https://gitlab.suse.de/openqa/auto-review/pipelines in https://gitlab.suse.de/openqa/auto-review/-/jobs/172723 also only shows to other reasons for incompletes.
By now incomplete openQA jobs should also give a "reason" with the relevant information directly available in the info box and available over API (and of course DB). With this the next time we can have an easier time identifying the issue.
We are still running with the reduced number of worker instances.
Updated by okurz about 5 years ago
- Blocked by action #64737: openqaworker-arm-3 is down since 2020-03-16, also IPMI unresponsive added
Updated by okurz about 5 years ago
- Status changed from Feedback to Blocked
I would check again and also increase number of worker instances again but openqaworker-arm-3 is completely down including the management interface, blocked by #64737
Updated by okurz about 5 years ago
- Status changed from Blocked to Resolved
openqaworker-arm-3 is back up, https://github.com/os-autoinst/openQA/pull/2895 should help on retriable errors.