Project

General

Profile

action #69553

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

coordination #62420: [epic] Distinguish all types of incompletes

job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback

Added by okurz 6 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-08-04
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.opensuse.org/tests/1352135 shows reason "setup failure: Failed to rsync tests: exit code 10", autoinst-log.txt shows:

[2020-08-03T19:34:17.0982 UTC] [info] Rsync from 'rsync://openqa1-opensuse/tests' to '/var/lib/openqa/cache/openqa1-opensuse', request #5593 sent to Cache Service
[2020-08-03T19:34:44.0427 UTC] [info] Output of rsync:
[info] [#5593] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/

[2020-08-03T19:34:44.0655 UTC] [info] +++ worker notes +++
[2020-08-03T19:34:44.0655 UTC] [info] End time: 2020-08-03 19:34:44
[2020-08-03T19:34:44.0656 UTC] [info] Result: setup failure

Steps to reproduce

Unclear how this can be reproduced but as long as auto_review is finding related issues we can find these jobs with https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label:

openqa-query-for-job-label poo#69553

Suggestions

  • Find out what "exit code 10" means "exit code 10" means "Error in socket I/O" (see man page). Improve feedback to users, e.g. what are possible reasons for the problem and what to do to fix or workaround

Workaround

Retriggering the job should work


Related issues

Related to openQA Infrastructure - action #46658: openqaworker4 caching failsResolved2019-01-25

Related to openQA Project - action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retryResolved2020-10-15

History

#1 Updated by okurz 6 months ago

#2 Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Normal to Low

I looked for recent occurences

$ ssh o3 "sudo -u geekotest psql openqa -c \"select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-01-01' and jobs.id = comments.job_id and comments.text ~ 'poo#69553' order by t_finished desc limit 10;\""
   id    |     t_finished      | state |   result   | test |                       reason                       |      host      
---------+---------------------+-------+------------+------+----------------------------------------------------+----------------
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqaworker7
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | oss-apollo7004
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | NONE
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | openqa-aarch64
 1352135 | 2020-08-03 19:34:49 | done  | incomplete | jeos | setup failure: Failed to rsync tests: exit code 10 | power8
(10 rows)

so did not happen again after 2020-08-03. However I think we can still improve at least the user feedback.

#3 Updated by okurz 3 months ago

  • Subject changed from job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry to job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry, improve user feedback

#4 Updated by okurz 3 months ago

  • Tags changed from caching, worker to caching, worker, cache, rsync, minion, ux, incomplete

#5 Updated by okurz 3 months ago

  • Description updated (diff)

#6 Updated by okurz 3 months ago

  • Related to action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retry added

#7 Updated by okurz 3 months ago

  • Description updated (diff)

#8 Updated by kraih 3 months ago

Yes, this does look very harmless, probably a temporary network issue. Retriggering the job should be fine.

#9 Updated by kraih 3 months ago

  • Assignee set to kraih

#10 Updated by okurz 3 months ago

seems like this can be easily solved together with #73396 and you already saw the other, related pull request.

#11 Updated by okurz 3 months ago

  • Parent task set to #62420

#12 Updated by kraih 2 months ago

  • Status changed from Workable to In Progress

#14 Updated by okurz 2 months ago

  • Status changed from In Progress to Feedback

merged. I suggest we wait until this is deployed to o3 and osd, then wait for some "auto-review" runs, check if this ticket is still used for auto-labelling and if not remove the auto_review: keyword from the ticket subject line and set the ticket to "Resolved". As an alternative to shortcut it a little bit we can after deployment look for any according log message in logfiles and then do the other steps.

#15 Updated by okurz about 2 months ago

  • Subject changed from job incompletes with auto_review:"Failed to rsync tests: exit code 10":retry, improve user feedback to job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback
  • Status changed from Feedback to Resolved

Since kraih likely is working on something different I checked right now with

env host="o3 osd" failed_since="NOW() - interval '90 day'" sh -ex $(which openqa-query-for-job-label) 69553

and could only find a single job from 2020-09-13 on osd that is still listed so this seems to work fine with the internal retrying

Also available in: Atom PDF