action #112871
closedobs_rsync_run Minion tasks fail with no error message size:M
Added by livdywan about 2 years ago. Updated 24 days ago.
0%
Description
Observation¶
Periodically obs_rsync_run
Minion tasks fail like so:
---
args:
- project: SUSE:SLE-15-SP3:Update:BCI
attempts: 1
children: []
created: 2022-06-20T13:16:56.947614Z
delayed: 2022-06-20T13:35:47.741627Z
expires: ~
finished: 2022-06-20T13:35:48.639216Z
id: 4720491
lax: 0
notes:
gru_id: 31834209
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: No message
retried: 2022-06-20T13:34:47.741627Z
retries: 18
started: 2022-06-20T13:35:47.983500Z
state: failed
task: obs_rsync_run
time: 2022-06-22T10:24:43.930467Z
worker: 767
There is no error message and I can't guess what might have caused this. There seems to be a code path that consumes errors without propagating them.
Acceptance criteria¶
- AC1: The jobs do not fail with unknown errors anymore
Suggestions¶
- Research the plugin and try to understand what the code is meant to achieve
- Maybe just make the job not fail, if there is no real error and this is an expected condition (Minion jobs are only supposed to fail for real errors that need human intervention)
- https://github.com/os-autoinst/openqa-trigger-from-obs
- Hypothesising the project in question was not configured, there is no files
- https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/V/images/iso/ is empty
- add a list of files to "notes" of the minion
- Add a IBS/OBS/GitLab URL to the "notes"
- Add stderr to "notes"
Updated by livdywan about 2 years ago
- Copied from action #70768: obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently added
Updated by okurz about 2 years ago
- Project changed from openQA Infrastructure to QA
- Target version set to future
- Start date deleted (
2020-09-01) - Estimated time deleted (
16.00 h)
Updated by tinita 10 months ago · Edited
- Target version changed from future to Ready
We discussed in the unblock that it would be good to get rid of the failures because it distracts from potential real errors.
Maybe it can just be turned into a log message.
@mkittler as a stand-in product owner agreed to put this in Ready ;-)
Updated by tinita 10 months ago
- Related to action #139073: ObsRsync plugin needs to support authentication with 2FA size:M added
Updated by livdywan 9 months ago
- Status changed from Workable to In Progress
Hypothesising the project in question was not configured, there is no files
https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/V/images/iso/ is empty
add a list of files to "notes" of the minion
Add a IBS/OBS/GitLab URL to the "notes"
Add stderr to "notes"
Added these points after discussing this in the unblock. We don't need to fix the project config here, but make sure the minion job has links to where it syncs from, and shows if/what files were synched.
Updated by osukup 9 months ago
So my findings as now are:
deployment
1) git repos on github and gitlab with obs_rsync scriptgen.py ( + some other scripts) + xml definitions of products ( repo-projects in IBS/OBS)
2) deployed on OSD by salt
3) base scripts and project folders are generated by salt + salt calling scriptgen.py with project name ( macro in salt)run
1) config in openSUSE-Release-Tools repo
2) started by botmaster.suse.de, if any changes in related ( projects hardcoded in gocd yaml config )
3) over OSD/O3 api call to run OBSRsync
4) Plugin in OSD/O3 generates files_iso.rst and run rsync.sh
5) rsync.sh copies files to subfolders/ prepares logs and run scripts generated by scriptgen ... ( this is script started by minion )
4) the first part of the script generates lst files with repositories and Images
5) The second part of the scripts generates the final scripts which calls rsync from dist.suse.de to openQA assets
6) also a second part, but now it generates the final bash script which containsopenqa-cli ....
with proper iso/post
7) all done - if everything passes - we have new build/builds running in instance
Problem starts when xml config for project isn't equivalent to IBS project and files_iso.lst ends empty ...
--> so solution should be simple --> add test if files_iso.lst is empty and if is --> send error message to stderror and exit 1 script --> no need to do anything else
- I asked about failing Dolomite jobs on #eng-testing and Jose Lausuch replied with "Ok, Staging... Maybe something vhangrd in IBS... I will have a look later.."
Updated by osukup 9 months ago
obs_rsync_update_builds_text can also fail without any message
result:
code: 256
message: ''
retried: ~
retries: 0
started: 2023-12-13T13:42:18.097617Z
state: failed
task: obs_rsync_update_builds_text
time: 2023-12-13T13:42:49.425004Z
worker: 1545
- I tried add skip + mesage into rsync.sh but without any results, it looks like all messages in this script are silently throw away
Updated by osukup 9 months ago
tried to add echo and explicit exit 1
into read_files.sh
... without any visible change
Q: why does this eat everything and throw away any error message?
--> https://github.com/os-autoinst/openQA/blob/97f9359a11a8174e4538f536cfb8ea82c482d3e1/lib/OpenQA/WebAPI/Plugin/ObsRsync/Task.pm#L57-L68 and https://github.com/os-autoinst/openQA/blob/97f9359a11a8174e4538f536cfb8ea82c482d3e1/lib/OpenQA/WebAPI/Plugin/ObsRsync/Task.pm#L57-L68
Updated by osukup 9 months ago
ah, I'm blind bash -e
https://github.com/os-autoinst/openqa-trigger-from-obs/blob/master/script/rsync.sh#L39C5-L39C39 ..
Updated by openqa_review 9 months ago
- Due date set to 2023-12-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 9 months ago
- Status changed from In Progress to Feedback
livdywan wrote in #note-15:
https://github.com/os-autoinst/openqa-trigger-from-obs/pull/237 under review
Looks like it worked? For example https://openqa.suse.de/minion/jobs?id=9781575
---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:B
attempts: 1
children: []
created: 2023-12-18T10:50:54.219849Z
delayed: 2023-12-18T10:50:54.219849Z
expires: ~
finished: 2023-12-18T10:50:54.845646Z
id: 9781575
lax: 0
notes:
gru_id: 35998859
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: |-
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/B/images/repo" (in repos) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1835) [Receiver=3.2.3]
read_files.sh failed for in SUSE:ALP:Source:Standard:1.0:Staging:B
retried: ~
retries: 0
started: 2023-12-18T10:50:54.221439Z
state: failed
task: obs_rsync_run
time: 2023-12-18T14:59:57.044997Z
worker: 1550
Updated by okurz 9 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1069 merged. I guess you can check for any upcoming minion jobs over the next day or two and then resolve.
Updated by okurz 9 months ago
- Due date changed from 2023-12-28 to 2024-01-12
- Status changed from Feedback to Workable
As discussed in weekly unblock 2023-12-20 we decided that we want the "obs_rsync_run" minion jobs to not fail on empty project directories and handle that gracefully. So the next step is to make minion jobs pass but with an according log entry like "empty project encountered, nothing to do.". You are welcome to implement that change as part of this ticket if it's simple and feasible or create a separate ticket with according context accordingly and resolve here.
Updated by tinita 9 months ago
Is it known that there are also still failed jobs without an error message? like https://openqa.suse.de/minion/jobs?id=9788472
result:
code: 256
message: read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:S in enviroment
SUSE:ALP:Source:Standard:1.0:Staging:S
So we know here that read_files.sh failed, but still not why.
Just wanted to mention it.
Updated by livdywan 9 months ago
tinita wrote in #note-21:
Is it known that there are also still failed jobs without an error message? like https://openqa.suse.de/minion/jobs?id=9788472
result: code: 256 message: read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:S in enviroment SUSE:ALP:Source:Standard:1.0:Staging:S
So we know here that read_files.sh failed, but still not why.
Just wanted to mention it.
Hrm. I'm not 100% sure now if this might be a different issue. I guess we'll see soon if it still occurs as my MR has been merged.
Updated by livdywan 9 months ago
- Status changed from Workable to In Progress
- Assignee changed from osukup to livdywan
So we're still seeing this e.g. https://openqa.suse.de/minion/jobs?id=9826853 meaning I'll try to confirm what would cause it here.
Updated by livdywan 9 months ago
okurz wrote in #note-20:
You are welcome to implement that change as part of this ticket if it's simple and feasible or create a separate ticket with according context accordingly and resolve here.
https://github.com/os-autoinst/openqa-trigger-from-obs/pull/238 This is to make it okay to fail. We'll still log the errors in case they are relevant for a later investigation.
Updated by livdywan 9 months ago
Brief update as I've not had a chance due to hols and catching up. I think there's two steps left here:
- Add a IBS/OBS/GitLab URL to the "notes"
The change to add the URL was merged. Unfortunately I didn't realize it is not expanding correctly until after it got merged.
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:C
url: https://api.suse.de/build/%%PROJECT/_result
https://github.com/os-autoinst/openqa-trigger-from-obs/pull/238 This is to make it okay to fail. We'll still log the errors in case they are relevant for a later investigation.
I'll propose a change to script/scriptgen.py / gen_read_files to make it more specific.
Updated by livdywan 8 months ago
Drafts coming up, still needs a little clean-up https://github.com/os-autoinst/openQA/pull/5420
Updated by livdywan 8 months ago
- Copied to action #153421: [spike][timeboxed:10h] Replace scriptgen with executing rsync from python added
Updated by livdywan 8 months ago
- Due date changed from 2024-01-12 to 2024-01-19
Due to productive but unrelated conversations, and since we should verify this in production over the coming days I'm bumping the due date. That pair-programming session should have happened a couple days earlier in the week for the timing to work out.
Updated by livdywan 8 months ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/5424 Draft for additional unit test coverage
Updated by livdywan 8 months ago
okurz wrote in #note-19:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1069 merged. I guess you can check for any upcoming minion jobs over the next day or two and then resolve.
Reverted by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1097#note_583220 which is in line with recent discussions on expected behavior
Updated by nicksinger 3 months ago
- Related to action #163340: OBSRSync regularily fails minion jobs - nobody cares, tools gets alerted (e.g. "Munin - minion Minion Jobs") size:M added