Project

General

Profile

action #70768

obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently

Added by mkittler 11 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-09-01
Due date:
% Done:

0%

Estimated time:
16.00 h

Description

Observation

The obs_rsync_run and obs_rsync_update_builds_text Minion task fails frequently on OSD (not o3).

The failing `obs_rsync_run can be observed using the following query parameters: https://openqa.suse.de/minion/jobs?soffset=0&task=obs_rsync_run&state=failed

I'm going to remove most of these jobs to calm down the alert but right now 43 jobs have piled up over 22 days. However, the problem actually exists longer than 22 days. I assume somebody cleaned up the Minion dashboard at some point.

The job arguments are always like this:

  "args" => [
    {
      "project" => "SUSE:SLE-15-SP3:GA:Staging:E"
    }
  ],

The results always look like one of these:

  "result" => {
    "code" => 256,
    "message" => "rsync: change_dir \"/SUSE:/SLE-15-SP3:/GA:/Staging:/S/images/iso\" (in repos) failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1674) [Receiver=3.1.3]"
  },
  "result" => {
    "code" => 256,
    "message" => "No file found: {SUSE:SLE-15-SP3:GA:Staging:B/read_files.sh}"
  },

So there are two different cases when something can not be found.

There are also some failing obs_rsync_update_builds_text jobs which look like this:

  "result" => {
    "code" => 256,
    "message" => "rsync: change_dir \"/SUSE:/SLE-15-SP3:/GA:/Staging:/E/images/iso\" (in repos) failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1674) [Receiver=3.1.3]\n"
  },

Suggestions

It looks like that these failures are not a practical problem - at least we haven't received any negative feedback. Likely it works again on the next run or the job was not needed anymore anyways. If that's true these jobs should not end up as failures┬╣ or shouldn't have been created in the first place. Maybe there's also an actual bug we need to fix.

┬╣ Failures in the sense of the Minion dashboard are jobs which should be manually investigated but I doubt these failing jobs should be manually investigated when they occur.

  • Check low-level commands executed by obsrsync, potentially try to reproduce manually
  • Check if source project folders exist or not

Just adjust our monitoring to ignore obs_rsync failures. If test reviewers find missing assets in their tests these tests will incomplete and https://openqa.suse.de/minion/jobs?state=failed has additional information available on demand.


Related issues

Related to openQA Infrastructure - action #70975: [alert] too many failed minion jobsResolved2020-09-04

History

#1 Updated by okurz 11 months ago

  • Tags set to obs_rsync, alert, minion
  • Project changed from openQA Project to openQA Infrastructure
  • Description updated (diff)
  • Status changed from New to Workable
  • Target version set to Ready

mkittler wrote: "Maybe there's also an actual bug we need to fix." but so far it looks like a problem regarding actual project directories not existing on IBS so I think this fits better in "openQA Infrastructure" until we know if there is an actual issue to fix in our code

#2 Updated by mkittler 11 months ago

  • Subject changed from obs_rsync_run Minion tasks fail frequently to obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently
  • Description updated (diff)

#3 Updated by okurz 11 months ago

deleted all but the oldest failed minion jobs from https://openqa.suse.de/minion/jobs?state=failed

#4 Updated by okurz 11 months ago

  • Related to action #70975: [alert] too many failed minion jobs added

#5 Updated by okurz 9 months ago

  • Priority changed from Normal to High

I have to repeatedly delete the according minion jobs. Happened again as reported in #77089, increasing prio. At the very least if we do not have a better idea we need to adjust monitoring to not alert us about this.

#6 Updated by Xiaojing_liu 8 months ago

There are some obs_rsync_run minion jobs failed. Some results show that:
https://openqa.suse.de/minion/jobs?id=933724

  "result" => {
    "code" => 512,
    "message" => "SUSE:SLE-15-SP3:GA:TEST/base/ exit code: 1 (1 failures total so far)\nSUSE:SLE-15-SP3:GA:TEST/jeos/ exit code: 1 (2 failures total so far)"
  },

And https://openqa.suse.de/minion/jobs?id=933542 shows:

  "result" => {
    "code" => 256,
    "message" => "No changes found since last run, skipping {SUSE:SLE-15-SP3:GA:TEST/base/}\nSUSE:SLE-15-SP3:GA:TEST/jeos/ exit code: 1 (1 failures total so far)\nNo changes found since last run, skipping {SUSE:SLE-15-SP3:GA:TEST/migration/}"
  },

obs_rsync_update_builds_text failed as this:
https://openqa.suse.de/minion/jobs?id=932921

  "result" => {
    "code" => 256,
    "message" => ""
  },

#7 Updated by okurz 7 months ago

  • Description updated (diff)

#8 Updated by okurz 7 months ago

grafana does not know which jobs are failing, only how many. The data comes from the route in openQA /admin/influxdb/minion
e.g.
https://openqa.suse.de/admin/influxdb/minion

with example:

openqa_minion_jobs,url=https://openqa.suse.de active=0i,delayed=0i,failed=5i,inactive=0i openqa_minion_workers,url=https://openqa.suse.de active=0i,inactive=1i 

The best thing I could think of right is to change openQA code to allow to filter which minion jobs are included in the count, for example based on a config variable with a regex string.

In https://github.com/os-autoinst/openQA/blob/e8470f0ab1b197c8dac5914645350d1412b62a2d/lib/OpenQA/WebAPI/Controller/Admin/Influxdb.pm#L107

one can filter out jobs based on this regex string, for example

$self->app->config->{global}->{influxdb_minion_job_blocklist}

and then the only thing to do for OSD would be to add this parameter in our config, for example after the line
https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L37
add

influxdb_minion_job_blocklist: .*obs_rsync.*

#9 Updated by Xiaojing_liu 7 months ago

  • Assignee set to Xiaojing_liu

#10 Updated by mkittler 7 months ago

Adding generic filtering here seems reasonable. However, in practice it is not completely straight forward to implement. So far we're using the Minion framework to provide us with the statistics. Annoyingly, this statistics function is quite limited so we can not easily filtering here.

I see multiple ways to workaround the limitation:

  1. We could use the jobs function instead. However, from the documentation it isn't clear how negative conditions would work and maybe it is not even possible. One way to workaround this would be to query for the tasks we want to filter out and subtract the number of returned jobs from the statistics we've got so far. Within this ticket we only care about failed jobs so one such query for the failed jobs would be sufficient. If the filtering should work regardless of the job state we needed to invoke the jobs function 4 times (one time for each state).
  2. We could also try to access the database tables Minion uses under the hood directly. Using SQL directly we would likely be able to apply the filtering more efficiently and it would likely still be a single query. In order to use the database directly, one would use $app->minion->backend->pg which will return a Mojo::Pg object.

#12 Updated by cdywan 7 months ago

  • Status changed from Workable to In Progress

I suspect this should be In Progress since there's a PR proposing the fix

#13 Updated by Xiaojing_liu 7 months ago

PR in openQA has been merged.
Here is the related MR for OSD: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/427

#14 Updated by cdywan 7 months ago

  • Status changed from In Progress to Feedback

#15 Updated by Xiaojing_liu 6 months ago

  • Status changed from Feedback to Resolved
  • Estimated time set to 16.00 h

on OSD https://openqa.suse.de/minion/jobs?state=failed, it shows there are 10 failed jobs, includes 4 'obs_rsync_run' and 1 'obs_rsync_update_builds_text'. Then querying https://openqa.suse.de/admin/influxdb/minion, it shows openqa_minion_jobs,url=https://openqa.suse.de active=4i,delayed=0i,failed=5i,inactive=36i openqa_minion_workers,url=https://openqa.suse.de active=1i,inactive=0i, only 5 fained jobs, so I considered this ticket as resolved.

Also available in: Atom PDF