Project

General

Profile

Actions

action #112871

closed

obs_rsync_run Minion tasks fail with no error message size:M

Added by livdywan over 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Periodically obs_rsync_run Minion tasks fail like so:

---
args:
- project: SUSE:SLE-15-SP3:Update:BCI
attempts: 1
children: []
created: 2022-06-20T13:16:56.947614Z
delayed: 2022-06-20T13:35:47.741627Z
expires: ~
finished: 2022-06-20T13:35:48.639216Z
id: 4720491
lax: 0
notes:
  gru_id: 31834209
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: No message
retried: 2022-06-20T13:34:47.741627Z
retries: 18
started: 2022-06-20T13:35:47.983500Z
state: failed
task: obs_rsync_run
time: 2022-06-22T10:24:43.930467Z
worker: 767

There is no error message and I can't guess what might have caused this. There seems to be a code path that consumes errors without propagating them.

Acceptance criteria

  • AC1: The jobs do not fail with unknown errors anymore

Suggestions


Related issues 4 (1 open3 closed)

Related to openQA Project - action #139073: ObsRsync plugin needs to support authentication with 2FA size:MResolvedtinita2023-11-032023-12-01

Actions
Related to openQA Infrastructure - action #163340: OBSRSync regularily fails minion jobs - nobody cares, tools gets alerted (e.g. "Munin - minion Minion Jobs") size:MResolveddheidler

Actions
Copied from openQA Infrastructure - action #70768: obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequentlyResolvedXiaojing_liu2020-09-01

Actions
Copied to QA - action #153421: [spike][timeboxed:10h] Replace scriptgen with executing rsync from pythonNew

Actions
Actions #1

Updated by livdywan over 2 years ago

  • Copied from action #70768: obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently added
Actions #2

Updated by okurz over 2 years ago

  • Project changed from openQA Infrastructure to QA
  • Target version set to future
  • Start date deleted (2020-09-01)
  • Estimated time deleted (16.00 h)
Actions #3

Updated by tinita 12 months ago · Edited

  • Target version changed from future to Ready

We discussed in the unblock that it would be good to get rid of the failures because it distracts from potential real errors.
Maybe it can just be turned into a log message.
@mkittler as a stand-in product owner agreed to put this in Ready ;-)

Actions #4

Updated by tinita 12 months ago

  • Related to action #139073: ObsRsync plugin needs to support authentication with 2FA size:M added
Actions #5

Updated by tinita 12 months ago

  • Subject changed from obs_rsync_run Minion tasks fail with no error message to obs_rsync_run Minion tasks fail with no error message size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by osukup 12 months ago

  • Assignee set to osukup
Actions #7

Updated by livdywan 11 months ago

  • Description updated (diff)
Actions #8

Updated by livdywan 11 months ago

  • Status changed from Workable to In Progress

Hypothesising the project in question was not configured, there is no files
https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/V/images/iso/ is empty
add a list of files to "notes" of the minion
Add a IBS/OBS/GitLab URL to the "notes"
Add stderr to "notes"

Added these points after discussing this in the unblock. We don't need to fix the project config here, but make sure the minion job has links to where it syncs from, and shows if/what files were synched.

Actions #9

Updated by osukup 11 months ago

So my findings as now are:

  • deployment
    1) git repos on github and gitlab with obs_rsync scriptgen.py ( + some other scripts) + xml definitions of products ( repo-projects in IBS/OBS)
    2) deployed on OSD by salt
    3) base scripts and project folders are generated by salt + salt calling scriptgen.py with project name ( macro in salt)

  • run
    1) config in openSUSE-Release-Tools repo
    2) started by botmaster.suse.de, if any changes in related ( projects hardcoded in gocd yaml config )
    3) over OSD/O3 api call to run OBSRsync
    4) Plugin in OSD/O3 generates files_iso.rst and run rsync.sh
    5) rsync.sh copies files to subfolders/ prepares logs and run scripts generated by scriptgen ... ( this is script started by minion )
    4) the first part of the script generates lst files with repositories and Images
    5) The second part of the scripts generates the final scripts which calls rsync from dist.suse.de to openQA assets
    6) also a second part, but now it generates the final bash script which contains openqa-cli .... with proper iso/post
    7) all done - if everything passes - we have new build/builds running in instance

Problem starts when xml config for project isn't equivalent to IBS project and files_iso.lst ends empty ...

--> so solution should be simple --> add test if files_iso.lst is empty and if is --> send error message to stderror and exit 1 script --> no need to do anything else

  • I asked about failing Dolomite jobs on #eng-testing and Jose Lausuch replied with "Ok, Staging... Maybe something vhangrd in IBS... I will have a look later.."
Actions #10

Updated by osukup 11 months ago

obs_rsync_update_builds_text can also fail without any message

result:
  code: 256
  message: ''
retried: ~
retries: 0
started: 2023-12-13T13:42:18.097617Z
state: failed
task: obs_rsync_update_builds_text
time: 2023-12-13T13:42:49.425004Z
worker: 1545
  • I tried add skip + mesage into rsync.sh but without any results, it looks like all messages in this script are silently throw away
Actions #11

Updated by osukup 11 months ago

obs_rsync_update_builds_text -> it runs read_files.sh ... which is also run by rsync.sh in obs_rsync_run task as subprocess

Actions #14

Updated by openqa_review 11 months ago

  • Due date set to 2023-12-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #16

Updated by livdywan 11 months ago

  • Status changed from In Progress to Feedback

livdywan wrote in #note-15:

https://github.com/os-autoinst/openqa-trigger-from-obs/pull/237 under review

Looks like it worked? For example https://openqa.suse.de/minion/jobs?id=9781575

---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:B
attempts: 1
children: []
created: 2023-12-18T10:50:54.219849Z
delayed: 2023-12-18T10:50:54.219849Z
expires: ~
finished: 2023-12-18T10:50:54.845646Z
id: 9781575
lax: 0
notes:
  gru_id: 35998859
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: |-
    rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/B/images/repo" (in repos) failed: No such file or directory (2)
    rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1835) [Receiver=3.2.3]
    read_files.sh failed for  in SUSE:ALP:Source:Standard:1.0:Staging:B
retried: ~
retries: 0
started: 2023-12-18T10:50:54.221439Z
state: failed
task: obs_rsync_run
time: 2023-12-18T14:59:57.044997Z
worker: 1550
Actions #17

Updated by livdywan 11 months ago · Edited

Dealing with the actual failures is still being discussed in Slack, the solution possibly being to drop builds which are never published from salt.

Actions #19

Updated by okurz 11 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1069 merged. I guess you can check for any upcoming minion jobs over the next day or two and then resolve.

Actions #20

Updated by okurz 11 months ago

  • Due date changed from 2023-12-28 to 2024-01-12
  • Status changed from Feedback to Workable

As discussed in weekly unblock 2023-12-20 we decided that we want the "obs_rsync_run" minion jobs to not fail on empty project directories and handle that gracefully. So the next step is to make minion jobs pass but with an according log entry like "empty project encountered, nothing to do.". You are welcome to implement that change as part of this ticket if it's simple and feasible or create a separate ticket with according context accordingly and resolve here.

Actions #21

Updated by tinita 11 months ago

Is it known that there are also still failed jobs without an error message? like https://openqa.suse.de/minion/jobs?id=9788472

result:
  code: 256
  message: read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:S in enviroment
    SUSE:ALP:Source:Standard:1.0:Staging:S

So we know here that read_files.sh failed, but still not why.

Just wanted to mention it.

Actions #22

Updated by livdywan 11 months ago

tinita wrote in #note-21:

Is it known that there are also still failed jobs without an error message? like https://openqa.suse.de/minion/jobs?id=9788472

result:
  code: 256
  message: read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:S in enviroment
    SUSE:ALP:Source:Standard:1.0:Staging:S

So we know here that read_files.sh failed, but still not why.

Just wanted to mention it.

Hrm. I'm not 100% sure now if this might be a different issue. I guess we'll see soon if it still occurs as my MR has been merged.

Actions #23

Updated by livdywan 11 months ago

  • Status changed from Workable to In Progress
  • Assignee changed from osukup to livdywan

So we're still seeing this e.g. https://openqa.suse.de/minion/jobs?id=9826853 meaning I'll try to confirm what would cause it here.

Actions #24

Updated by livdywan 11 months ago

okurz wrote in #note-20:

You are welcome to implement that change as part of this ticket if it's simple and feasible or create a separate ticket with according context accordingly and resolve here.

https://github.com/os-autoinst/openqa-trigger-from-obs/pull/238 This is to make it okay to fail. We'll still log the errors in case they are relevant for a later investigation.

Actions #25

Updated by livdywan 11 months ago

livdywan wrote:

  • Add a IBS/OBS/GitLab URL to the "notes"

https://github.com/os-autoinst/openQA/pull/5409

Actions #26

Updated by livdywan 11 months ago

Brief update as I've not had a chance due to hols and catching up. I think there's two steps left here:

  • Add a IBS/OBS/GitLab URL to the "notes"

https://github.com/os-autoinst/openQA/pull/5409

The change to add the URL was merged. Unfortunately I didn't realize it is not expanding correctly until after it got merged.

args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:C
  url: https://api.suse.de/build/%%PROJECT/_result

https://github.com/os-autoinst/openqa-trigger-from-obs/pull/238 This is to make it okay to fail. We'll still log the errors in case they are relevant for a later investigation.

I'll propose a change to script/scriptgen.py / gen_read_files to make it more specific.

Actions #27

Updated by livdywan 11 months ago

Drafts coming up, still needs a little clean-up https://github.com/os-autoinst/openQA/pull/5420

Actions #28

Updated by livdywan 11 months ago

  • Copied to action #153421: [spike][timeboxed:10h] Replace scriptgen with executing rsync from python added
Actions #29

Updated by livdywan 10 months ago

  • Due date changed from 2024-01-12 to 2024-01-19

Due to productive but unrelated conversations, and since we should verify this in production over the coming days I'm bumping the due date. That pair-programming session should have happened a couple days earlier in the week for the timing to work out.

Actions #30

Updated by livdywan 10 months ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5424 Draft for additional unit test coverage

Actions #31

Updated by okurz 10 months ago

  • Status changed from Feedback to Workable
Actions #32

Updated by okurz 10 months ago

  • Due date deleted (2024-01-19)
  • Priority changed from Normal to Low

making this a "background idle" task for our Scrum Master :)

Actions #33

Updated by livdywan 10 months ago

okurz wrote in #note-19:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1069 merged. I guess you can check for any upcoming minion jobs over the next day or two and then resolve.

Reverted by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1097#note_583220 which is in line with recent discussions on expected behavior

Actions #34

Updated by livdywan 5 months ago

  • Assignee deleted (livdywan)

In case someone else is inclined to work on it feel free to grab it. I'm not sure when I'll get to it.

Actions #35

Updated by okurz 5 months ago

  • Target version changed from Ready to Tools - Next
Actions #36

Updated by nicksinger 5 months ago

  • Related to action #163340: OBSRSync regularily fails minion jobs - nobody cares, tools gets alerted (e.g. "Munin - minion Minion Jobs") size:M added
Actions #37

Updated by livdywan 3 months ago

  • Status changed from Workable to Resolved
  • Assignee set to livdywan

AC1: The jobs do not fail with unknown errors anymore

The AC1 was fulfilled. Further improvements can be left to other tickets such as #163340.

Actions

Also available in: Atom PDF