action #103771
closedRetry on rsync errors like "exit code 5" instead of failing the job (which then retriggers)
0%
Description
Observation¶
https://openqa.suse.de/tests/7816422 incompleted with "Reason: cache failure: Failed to rsync tests: exit code 5" during the time I conducted the OSD Leap 15.2->15.3 upgrade (#99198).
https://openqa.suse.de/tests/7816422/logfile?filename=autoinst-log.txt says
[2021-12-09T14:08:29.710481+01:00] [info] [pid:23787] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #11572 sent to Cache Service
[2021-12-09T14:08:34.962673+01:00] [info] [pid:23787] Output of rsync:
[info] [#11572] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
[2021-12-09T14:08:34.962875+01:00] [error] [pid:23787] Failed to rsync tests: exit code 5
https://openqa.suse.de/tests/7816422/logfile?filename=worker-log.txt says
SP2-Installer-DVD-x86_64-GM-DVD1.iso" to "/var/lib/openqa/pool/7/SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso"
[2021-12-09T14:08:29.710796+01:00] [debug] [pid:23787] Updating status so job 7816422 is not considered dead.
[2021-12-09T14:08:29.711357+01:00] [debug] [pid:23787] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7816422/status
[2021-12-09T14:08:34.811331+01:00] [debug] [pid:23787] Updating status so job 7816422 is not considered dead.
[2021-12-09T14:08:34.812154+01:00] [debug] [pid:23787] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7816422/status
[2021-12-09T14:08:34.962989+01:00] [error] [pid:23787] Unable to setup job 7816422: Failed to rsync tests: exit code 5
[2021-12-09T14:08:34.963141+01:00] [debug] [pid:23787] Stopping job 7816422 from openqa.suse.de: 07816422-sle-15-SP2-Server-DVD-Incidents-x86_64-Build:22102:release-notes-sles-mau-filesystem@64bit - reason: setup failure
[2021-12-09T14:08:34.963556+01:00] [debug] [pid:23787] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7816422/status
[2021-12-09T14:08:35.027993+01:00] [info] [pid:23927] Uploading autoinst-log.txt
Acceptance criteria¶
- AC1: The cache download retries download for a reasonable time to cover unavailability of the cache target in similar cases
Suggestions¶
The man page of rsync explains that exit code 5 means "Error starting client-server protocol". We should ~~either instruct rsync to retry on that (seems to be not a feature of retry) or ~~ put some retry around the rsync call. Maybe use https://metacpan.org/dist/App-rsync-retry/view/script/rsync-retry, currently only in devel:languages:perl:CPAN-A
Updated by okurz almost 3 years ago
Not sure if we will use it but created a SR already to add the mentioned perl helper package to devel:languages:perl: https://build.opensuse.org/request/show/937791
Updated by okurz almost 3 years ago
- Related to action #99198: Upgrade osd webUI host to openSUSE Leap 15.3 size:M added
Updated by osukup almost 3 years ago
- Status changed from New to In Progress
- Assignee set to osukup
Updated by livdywan almost 3 years ago
osukup wrote:
Review on-going. The new tests appear to be flaky and warrant more investigation
Updated by osukup almost 3 years ago
- Status changed from In Progress to Resolved
PR merged. Tests are in unstable group ( so probably not related to change)
Big thanks to @tinita for work on tests :D
Updated by okurz almost 3 years ago
Well, we don't actually know if this works in production but this time I will trust you :)