Project

General

Profile

action #103771

Retry on rsync errors like "exit code 5" instead of failing the job (which then retriggers)

Added by okurz 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-12-09
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.suse.de/tests/7816422 incompleted with "Reason: cache failure: Failed to rsync tests: exit code 5" during the time I conducted the OSD Leap 15.2->15.3 upgrade (#99198).

https://openqa.suse.de/tests/7816422/logfile?filename=autoinst-log.txt says

[2021-12-09T14:08:29.710481+01:00] [info] [pid:23787] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #11572 sent to Cache Service
[2021-12-09T14:08:34.962673+01:00] [info] [pid:23787] Output of rsync:
[info] [#11572] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/

[2021-12-09T14:08:34.962875+01:00] [error] [pid:23787] Failed to rsync tests: exit code 5

https://openqa.suse.de/tests/7816422/logfile?filename=worker-log.txt says

SP2-Installer-DVD-x86_64-GM-DVD1.iso" to "/var/lib/openqa/pool/7/SLE-15-SP2-Installer-DVD-x86_64-GM-DVD1.iso"
[2021-12-09T14:08:29.710796+01:00] [debug] [pid:23787] Updating status so job 7816422 is not considered dead.
[2021-12-09T14:08:29.711357+01:00] [debug] [pid:23787] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7816422/status
[2021-12-09T14:08:34.811331+01:00] [debug] [pid:23787] Updating status so job 7816422 is not considered dead.
[2021-12-09T14:08:34.812154+01:00] [debug] [pid:23787] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7816422/status
[2021-12-09T14:08:34.962989+01:00] [error] [pid:23787] Unable to setup job 7816422: Failed to rsync tests: exit code 5
[2021-12-09T14:08:34.963141+01:00] [debug] [pid:23787] Stopping job 7816422 from openqa.suse.de: 07816422-sle-15-SP2-Server-DVD-Incidents-x86_64-Build:22102:release-notes-sles-mau-filesystem@64bit - reason: setup failure
[2021-12-09T14:08:34.963556+01:00] [debug] [pid:23787] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7816422/status
[2021-12-09T14:08:35.027993+01:00] [info] [pid:23927] Uploading autoinst-log.txt

Acceptance criteria

  • AC1: The cache download retries download for a reasonable time to cover unavailability of the cache target in similar cases

Suggestions

The man page of rsync explains that exit code 5 means "Error starting client-server protocol". We should ~~either instruct rsync to retry on that (seems to be not a feature of retry) or ~~ put some retry around the rsync call. Maybe use https://metacpan.org/dist/App-rsync-retry/view/script/rsync-retry, currently only in devel:languages:perl:CPAN-A


Related issues

Related to openQA Infrastructure - action #99198: Upgrade osd webUI host to openSUSE Leap 15.3 size:MResolved

History

#1 Updated by okurz 5 months ago

Not sure if we will use it but created a SR already to add the mentioned perl helper package to devel:languages:perl: https://build.opensuse.org/request/show/937791

#2 Updated by okurz 5 months ago

  • Related to action #99198: Upgrade osd webUI host to openSUSE Leap 15.3 size:M added

#3 Updated by osukup 4 months ago

  • Status changed from New to In Progress
  • Assignee set to osukup

#4 Updated by cdywan 4 months ago

osukup wrote:

https://github.com/os-autoinst/openQA/pull/4464

Review on-going. The new tests appear to be flaky and warrant more investigation

#5 Updated by osukup 3 months ago

  • Status changed from In Progress to Resolved

PR merged. Tests are in unstable group ( so probably not related to change)

Big thanks to tinita for work on tests :D

#6 Updated by okurz 3 months ago

Well, we don't actually know if this works in production but this time I will trust you :)

Also available in: Atom PDF