action #70768
closedobs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently
0%
Description
Observation¶
The obs_rsync_run
and obs_rsync_update_builds_text
Minion task fails frequently on OSD (not o3).
The failing `obs_rsync_run
can be observed using the following query parameters: https://openqa.suse.de/minion/jobs?soffset=0&task=obs_rsync_run&state=failed
I'm going to remove most of these jobs to calm down the alert but right now 43 jobs have piled up over 22 days. However, the problem actually exists longer than 22 days. I assume somebody cleaned up the Minion dashboard at some point.
The job arguments are always like this:
"args" => [
{
"project" => "SUSE:SLE-15-SP3:GA:Staging:E"
}
],
The results always look like one of these:
"result" => {
"code" => 256,
"message" => "rsync: change_dir \"/SUSE:/SLE-15-SP3:/GA:/Staging:/S/images/iso\" (in repos) failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1674) [Receiver=3.1.3]"
},
"result" => {
"code" => 256,
"message" => "No file found: {SUSE:SLE-15-SP3:GA:Staging:B/read_files.sh}"
},
So there are two different cases when something can not be found.
There are also some failing obs_rsync_update_builds_text
jobs which look like this:
"result" => {
"code" => 256,
"message" => "rsync: change_dir \"/SUSE:/SLE-15-SP3:/GA:/Staging:/E/images/iso\" (in repos) failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1674) [Receiver=3.1.3]\n"
},
Suggestions¶
It looks like that these failures are not a practical problem - at least we haven't received any negative feedback. Likely it works again on the next run or the job was not needed anymore anyways. If that's true these jobs should not end up as failures¹ or shouldn't have been created in the first place. Maybe there's also an actual bug we need to fix.
¹ Failures in the sense of the Minion dashboard are jobs which should be manually investigated but I doubt these failing jobs should be manually investigated when they occur.
Check low-level commands executed by obsrsync, potentially try to reproduce manuallyCheck if source project folders exist or not
Just adjust our monitoring to ignore obs_rsync failures. If test reviewers find missing assets in their tests these tests will incomplete and https://openqa.suse.de/minion/jobs?state=failed has additional information available on demand.
Updated by okurz over 4 years ago
- Tags set to obs_rsync, alert, minion
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Description updated (diff)
- Status changed from New to Workable
- Target version set to Ready
mkittler wrote: "Maybe there's also an actual bug we need to fix." but so far it looks like a problem regarding actual project directories not existing on IBS so I think this fits better in "openQA Infrastructure" until we know if there is an actual issue to fix in our code
Updated by mkittler over 4 years ago
- Subject changed from obs_rsync_run Minion tasks fail frequently to obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently
- Description updated (diff)
Updated by okurz over 4 years ago
deleted all but the oldest failed minion jobs from https://openqa.suse.de/minion/jobs?state=failed
Updated by okurz over 4 years ago
- Related to action #70975: [alert] too many failed minion jobs added
Updated by okurz about 4 years ago
- Priority changed from Normal to High
I have to repeatedly delete the according minion jobs. Happened again as reported in #77089, increasing prio. At the very least if we do not have a better idea we need to adjust monitoring to not alert us about this.
Updated by Xiaojing_liu about 4 years ago
There are some obs_rsync_run
minion jobs failed. Some results show that:
https://openqa.suse.de/minion/jobs?id=933724
"result" => {
"code" => 512,
"message" => "SUSE:SLE-15-SP3:GA:TEST/base/ exit code: 1 (1 failures total so far)\nSUSE:SLE-15-SP3:GA:TEST/jeos/ exit code: 1 (2 failures total so far)"
},
And https://openqa.suse.de/minion/jobs?id=933542 shows:
"result" => {
"code" => 256,
"message" => "No changes found since last run, skipping {SUSE:SLE-15-SP3:GA:TEST/base/}\nSUSE:SLE-15-SP3:GA:TEST/jeos/ exit code: 1 (1 failures total so far)\nNo changes found since last run, skipping {SUSE:SLE-15-SP3:GA:TEST/migration/}"
},
obs_rsync_update_builds_text failed as this:
https://openqa.suse.de/minion/jobs?id=932921
"result" => {
"code" => 256,
"message" => ""
},
Updated by okurz about 4 years ago
grafana does not know which jobs are failing, only how many. The data comes from the route in openQA /admin/influxdb/minion
e.g.
https://openqa.suse.de/admin/influxdb/minion
with example:
openqa_minion_jobs,url=https://openqa.suse.de active=0i,delayed=0i,failed=5i,inactive=0i openqa_minion_workers,url=https://openqa.suse.de active=0i,inactive=1i
The best thing I could think of right is to change openQA code to allow to filter which minion jobs are included in the count, for example based on a config variable with a regex string.
one can filter out jobs based on this regex string, for example
$self->app->config->{global}->{influxdb_minion_job_blocklist}
and then the only thing to do for OSD would be to add this parameter in our config, for example after the line
https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L37
add
influxdb_minion_job_blocklist: .*obs_rsync.*
Updated by mkittler about 4 years ago
Adding generic filtering here seems reasonable. However, in practice it is not completely straight forward to implement. So far we're using the Minion framework to provide us with the statistics. Annoyingly, this statistics function is quite limited so we can not easily filtering here.
I see multiple ways to workaround the limitation:
- We could use the jobs function instead. However, from the documentation it isn't clear how negative conditions would work and maybe it is not even possible. One way to workaround this would be to query for the tasks we want to filter out and subtract the number of returned jobs from the statistics we've got so far. Within this ticket we only care about failed jobs so one such query for the failed jobs would be sufficient. If the filtering should work regardless of the job state we needed to invoke the jobs function 4 times (one time for each state).
- We could also try to access the database tables Minion uses under the hood directly. Using SQL directly we would likely be able to apply the filtering more efficiently and it would likely still be a single query. In order to use the database directly, one would use
$app->minion->backend->pg
which will return a Mojo::Pg object.
Updated by Xiaojing_liu about 4 years ago
Updated by livdywan almost 4 years ago
- Status changed from Workable to In Progress
I suspect this should be In Progress since there's a PR proposing the fix
Updated by Xiaojing_liu almost 4 years ago
PR in openQA has been merged.
Here is the related MR for OSD: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/427
Updated by livdywan almost 4 years ago
- Status changed from In Progress to Feedback
Updated by Xiaojing_liu almost 4 years ago
- Status changed from Feedback to Resolved
- Estimated time set to 16.00 h
on OSD https://openqa.suse.de/minion/jobs?state=failed, it shows there are 10 failed jobs, includes 4 'obs_rsync_run' and 1 'obs_rsync_update_builds_text'. Then querying https://openqa.suse.de/admin/influxdb/minion, it shows openqa_minion_jobs,url=https://openqa.suse.de active=4i,delayed=0i,failed=5i,inactive=36i openqa_minion_workers,url=https://openqa.suse.de active=1i,inactive=0i
, only 5 fained jobs, so I considered this ticket as resolved.
Updated by mkittler over 3 years ago
- Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added
Updated by livdywan over 2 years ago
- Copied to action #112871: obs_rsync_run Minion tasks fail with no error message size:M added