action #69553
closed
coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback
Added by okurz over 4 years ago.
Updated about 4 years ago.
Category:
Regressions/Crashes
Description
Observation¶
https://openqa.opensuse.org/tests/1352135 shows reason "setup failure: Failed to rsync tests: exit code 10", autoinst-log.txt shows:
[2020-08-03T19:34:17.0982 UTC] [info] Rsync from 'rsync://openqa1-opensuse/tests' to '/var/lib/openqa/cache/openqa1-opensuse', request #5593 sent to Cache Service
[2020-08-03T19:34:44.0427 UTC] [info] Output of rsync:
[info] [#5593] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
[2020-08-03T19:34:44.0655 UTC] [info] +++ worker notes +++
[2020-08-03T19:34:44.0655 UTC] [info] End time: 2020-08-03 19:34:44
[2020-08-03T19:34:44.0656 UTC] [info] Result: setup failure
Steps to reproduce¶
Unclear how this can be reproduced but as long as auto_review is finding related issues we can find these jobs with https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label:
openqa-query-for-job-label poo#69553
Suggestions¶
Find out what "exit code 10" means "exit code 10" means "Error in socket I/O" (see man page). Improve feedback to users, e.g. what are possible reasons for the problem and what to do to fix or workaround
Workaround¶
Retriggering the job should work
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Normal to Low
I looked for recent occurences
$ ssh o3 "sudo -u geekotest psql openqa -c \"select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-01-01' and jobs.id = comments.job_id and comments.text ~ 'poo#69553' order by t_finished desc limit 10;\""
id | t_finished | state | result | test | reason | host
---------+---------------------+-------+------------+------+----------------------------------------------------+----------------
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqaworker7
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | oss-apollo7004
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | NONE
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
1352135 | 2020-08-03 19:34:49 | done | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
(10 rows)
so did not happen again after 2020-08-03. However I think we can still improve at least the user feedback.
- Subject changed from job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry to job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry, improve user feedback
- Tags changed from caching, worker to caching, worker, cache, rsync, minion, ux, incomplete
- Description updated (diff)
- Related to action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retry added
- Description updated (diff)
Yes, this does look very harmless, probably a temporary network issue. Retriggering the job should be fine.
seems like this can be easily solved together with #73396 and you already saw the other, related pull request.
- Parent task set to #62420
- Status changed from Workable to In Progress
- Status changed from In Progress to Feedback
merged. I suggest we wait until this is deployed to o3 and osd, then wait for some "auto-review" runs, check if this ticket is still used for auto-labelling and if not remove the auto_review:
keyword from the ticket subject line and set the ticket to "Resolved". As an alternative to shortcut it a little bit we can after deployment look for any according log message in logfiles and then do the other steps.
- Subject changed from job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry, improve user feedback to job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback
- Status changed from Feedback to Resolved
Since kraih likely is working on something different I checked right now with
env host="o3 osd" failed_since="NOW() - interval '90 day'" sh -ex $(which openqa-query-for-job-label) 69553
and could only find a single job from 2020-09-13 on osd that is still listed so this seems to work fine with the internal retrying
Also available in: Atom
PDF