action #69553
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback
0%
Description
Observation¶
https://openqa.opensuse.org/tests/1352135 shows reason "setup failure: Failed to rsync tests: exit code 10", autoinst-log.txt shows:
[2020-08-03T19:34:17.0982 UTC] [info] Rsync from 'rsync://openqa1-opensuse/tests' to '/var/lib/openqa/cache/openqa1-opensuse', request #5593 sent to Cache Service
[2020-08-03T19:34:44.0427 UTC] [info] Output of rsync:
[info] [#5593] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
[2020-08-03T19:34:44.0655 UTC] [info] +++ worker notes +++
[2020-08-03T19:34:44.0655 UTC] [info] End time: 2020-08-03 19:34:44
[2020-08-03T19:34:44.0656 UTC] [info] Result: setup failure
Steps to reproduce¶
Unclear how this can be reproduced but as long as auto_review is finding related issues we can find these jobs with https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label:
openqa-query-for-job-label poo#69553
Suggestions¶
Find out what "exit code 10" means"exit code 10" means "Error in socket I/O" (see man page). Improve feedback to users, e.g. what are possible reasons for the problem and what to do to fix or workaround
Workaround¶
Retriggering the job should work
Updated by okurz about 4 years ago
- Related to action #46658: openqaworker4 caching fails added
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Normal to Low
I looked for recent occurences
$ ssh o3 "sudo -u geekotest psql openqa -c \"select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-01-01' and jobs.id = comments.job_id and comments.text ~ 'poo#69553' order by t_finished desc limit 10;\""
id | t_finished | state | result | test | reason | host
---------+---------------------+-------+------------+------+----------------------------------------------------+----------------
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqaworker7
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | oss-apollo7004
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | NONE
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
(10 rows)
so did not happen again after 2020-08-03. However I think we can still improve at least the user feedback.
Updated by okurz about 4 years ago
- Subject changed from job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry to job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry, improve user feedback
Updated by okurz about 4 years ago
- Tags changed from caching, worker to caching, worker, cache, rsync, minion, ux, incomplete
Updated by okurz about 4 years ago
- Related to action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retry added
Updated by kraih about 4 years ago
Yes, this does look very harmless, probably a temporary network issue. Retriggering the job should be fine.
Updated by okurz about 4 years ago
seems like this can be easily solved together with #73396 and you already saw the other, related pull request.
Updated by kraih almost 4 years ago
- Status changed from Workable to In Progress
Updated by kraih almost 4 years ago
Updated by okurz almost 4 years ago
- Status changed from In Progress to Feedback
merged. I suggest we wait until this is deployed to o3 and osd, then wait for some "auto-review" runs, check if this ticket is still used for auto-labelling and if not remove the auto_review:
keyword from the ticket subject line and set the ticket to "Resolved". As an alternative to shortcut it a little bit we can after deployment look for any according log message in logfiles and then do the other steps.
Updated by okurz almost 4 years ago
- Subject changed from job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry, improve user feedback to job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback
- Status changed from Feedback to Resolved
Since kraih likely is working on something different I checked right now with
env host="o3 osd" failed_since="NOW() - interval '90 day'" sh -ex $(which openqa-query-for-job-label) 69553
and could only find a single job from 2020-09-13 on osd that is still listed so this seems to work fine with the internal retrying