Project

General

Profile

Actions

action #70768

closed

obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently

Added by mkittler over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2020-09-01
Due date:
% Done:

0%

Estimated time:
16.00 h

Description

Observation

The obs_rsync_run and obs_rsync_update_builds_text Minion task fails frequently on OSD (not o3).

The failing `obs_rsync_run can be observed using the following query parameters: https://openqa.suse.de/minion/jobs?soffset=0&task=obs_rsync_run&state=failed

I'm going to remove most of these jobs to calm down the alert but right now 43 jobs have piled up over 22 days. However, the problem actually exists longer than 22 days. I assume somebody cleaned up the Minion dashboard at some point.

The job arguments are always like this:

  "args" => [
    {
      "project" => "SUSE:SLE-15-SP3:GA:Staging:E"
    }
  ],

The results always look like one of these:

  "result" => {
    "code" => 256,
    "message" => "rsync: change_dir \"/SUSE:/SLE-15-SP3:/GA:/Staging:/S/images/iso\" (in repos) failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1674) [Receiver=3.1.3]"
  },
  "result" => {
    "code" => 256,
    "message" => "No file found: {SUSE:SLE-15-SP3:GA:Staging:B/read_files.sh}"
  },

So there are two different cases when something can not be found.

There are also some failing obs_rsync_update_builds_text jobs which look like this:

  "result" => {
    "code" => 256,
    "message" => "rsync: change_dir \"/SUSE:/SLE-15-SP3:/GA:/Staging:/E/images/iso\" (in repos) failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1674) [Receiver=3.1.3]\n"
  },

Suggestions

It looks like that these failures are not a practical problem - at least we haven't received any negative feedback. Likely it works again on the next run or the job was not needed anymore anyways. If that's true these jobs should not end up as failures¹ or shouldn't have been created in the first place. Maybe there's also an actual bug we need to fix.

¹ Failures in the sense of the Minion dashboard are jobs which should be manually investigated but I doubt these failing jobs should be manually investigated when they occur.

  • Check low-level commands executed by obsrsync, potentially try to reproduce manually
  • Check if source project folders exist or not

Just adjust our monitoring to ignore obs_rsync failures. If test reviewers find missing assets in their tests these tests will incomplete and https://openqa.suse.de/minion/jobs?state=failed has additional information available on demand.


Related issues 3 (2 open1 closed)

Related to openQA Infrastructure - action #70975: [alert] too many failed minion jobsResolvedokurz2020-09-04

Actions
Related to openQA Project - coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alertNew2020-09-01

Actions
Copied to QA - action #112871: obs_rsync_run Minion tasks fail with no error message size:MWorkablelivdywan

Actions
Actions #1

Updated by okurz over 3 years ago

  • Tags set to obs_rsync, alert, minion
  • Project changed from openQA Project to openQA Infrastructure
  • Description updated (diff)
  • Status changed from New to Workable
  • Target version set to Ready

mkittler wrote: "Maybe there's also an actual bug we need to fix." but so far it looks like a problem regarding actual project directories not existing on IBS so I think this fits better in "openQA Infrastructure" until we know if there is an actual issue to fix in our code

Actions #2

Updated by mkittler over 3 years ago

  • Subject changed from obs_rsync_run Minion tasks fail frequently to obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently
  • Description updated (diff)
Actions #3

Updated by okurz over 3 years ago

deleted all but the oldest failed minion jobs from https://openqa.suse.de/minion/jobs?state=failed

Actions #4

Updated by okurz over 3 years ago

  • Related to action #70975: [alert] too many failed minion jobs added
Actions #5

Updated by okurz over 3 years ago

  • Priority changed from Normal to High

I have to repeatedly delete the according minion jobs. Happened again as reported in #77089, increasing prio. At the very least if we do not have a better idea we need to adjust monitoring to not alert us about this.

Actions #6

Updated by Xiaojing_liu over 3 years ago

There are some obs_rsync_run minion jobs failed. Some results show that:
https://openqa.suse.de/minion/jobs?id=933724

  "result" => {
    "code" => 512,
    "message" => "SUSE:SLE-15-SP3:GA:TEST/base/ exit code: 1 (1 failures total so far)\nSUSE:SLE-15-SP3:GA:TEST/jeos/ exit code: 1 (2 failures total so far)"
  },

And https://openqa.suse.de/minion/jobs?id=933542 shows:

  "result" => {
    "code" => 256,
    "message" => "No changes found since last run, skipping {SUSE:SLE-15-SP3:GA:TEST/base/}\nSUSE:SLE-15-SP3:GA:TEST/jeos/ exit code: 1 (1 failures total so far)\nNo changes found since last run, skipping {SUSE:SLE-15-SP3:GA:TEST/migration/}"
  },

obs_rsync_update_builds_text failed as this:
https://openqa.suse.de/minion/jobs?id=932921

  "result" => {
    "code" => 256,
    "message" => ""
  },
Actions #7

Updated by okurz over 3 years ago

  • Description updated (diff)
Actions #8

Updated by okurz over 3 years ago

grafana does not know which jobs are failing, only how many. The data comes from the route in openQA /admin/influxdb/minion
e.g.
https://openqa.suse.de/admin/influxdb/minion

with example:

openqa_minion_jobs,url=https://openqa.suse.de active=0i,delayed=0i,failed=5i,inactive=0i openqa_minion_workers,url=https://openqa.suse.de active=0i,inactive=1i 

The best thing I could think of right is to change openQA code to allow to filter which minion jobs are included in the count, for example based on a config variable with a regex string.

In https://github.com/os-autoinst/openQA/blob/e8470f0ab1b197c8dac5914645350d1412b62a2d/lib/OpenQA/WebAPI/Controller/Admin/Influxdb.pm#L107

one can filter out jobs based on this regex string, for example

$self->app->config->{global}->{influxdb_minion_job_blocklist}

and then the only thing to do for OSD would be to add this parameter in our config, for example after the line
https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L37
add

influxdb_minion_job_blocklist: .*obs_rsync.*
Actions #9

Updated by Xiaojing_liu over 3 years ago

  • Assignee set to Xiaojing_liu
Actions #10

Updated by mkittler over 3 years ago

Adding generic filtering here seems reasonable. However, in practice it is not completely straight forward to implement. So far we're using the Minion framework to provide us with the statistics. Annoyingly, this statistics function is quite limited so we can not easily filtering here.

I see multiple ways to workaround the limitation:

  1. We could use the jobs function instead. However, from the documentation it isn't clear how negative conditions would work and maybe it is not even possible. One way to workaround this would be to query for the tasks we want to filter out and subtract the number of returned jobs from the statistics we've got so far. Within this ticket we only care about failed jobs so one such query for the failed jobs would be sufficient. If the filtering should work regardless of the job state we needed to invoke the jobs function 4 times (one time for each state).
  2. We could also try to access the database tables Minion uses under the hood directly. Using SQL directly we would likely be able to apply the filtering more efficiently and it would likely still be a single query. In order to use the database directly, one would use $app->minion->backend->pg which will return a Mojo::Pg object.
Actions #12

Updated by livdywan over 3 years ago

  • Status changed from Workable to In Progress

I suspect this should be In Progress since there's a PR proposing the fix

Actions #13

Updated by Xiaojing_liu over 3 years ago

PR in openQA has been merged.
Here is the related MR for OSD: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/427

Actions #14

Updated by livdywan over 3 years ago

  • Status changed from In Progress to Feedback
Actions #15

Updated by Xiaojing_liu over 3 years ago

  • Status changed from Feedback to Resolved
  • Estimated time set to 16.00 h

on OSD https://openqa.suse.de/minion/jobs?state=failed, it shows there are 10 failed jobs, includes 4 'obs_rsync_run' and 1 'obs_rsync_update_builds_text'. Then querying https://openqa.suse.de/admin/influxdb/minion, it shows openqa_minion_jobs,url=https://openqa.suse.de active=4i,delayed=0i,failed=5i,inactive=36i openqa_minion_workers,url=https://openqa.suse.de active=1i,inactive=0i, only 5 fained jobs, so I considered this ticket as resolved.

Actions #16

Updated by mkittler over 2 years ago

  • Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added
Actions #17

Updated by livdywan almost 2 years ago

  • Copied to action #112871: obs_rsync_run Minion tasks fail with no error message size:M added
Actions

Also available in: Atom PDF