Project

General

Profile

Actions

action #176013

closed

coordination #161414: [epic] Improved salt based infrastructure management

[alert] web UI: Too many Minion job failures alert size:S

Added by tinita 15 days ago. Updated 9 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-23
Due date:
% Done:

0%

Estimated time:

Description

Observation

Date: Wed, 22 Jan 2025 22:44:38 +0100

https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1

Looking at https://openqa.suse.de/minion/jobs?state=failed most failed jobs seem to be obs_rsync.

---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:Y
  url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:Y/_result?package=000product
attempts: 1
children: []
created: 2025-01-22T16:47:20.274616Z
delayed: 2025-01-22T16:47:20.274616Z
expires: ~
finished: 2025-01-22T16:47:20.667998Z
id: 14216909
lax: 0
notes:
  gru_id: 39649588
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:Y in enviroment
    SUSE:ALP:Source:Standard:1.0:Staging:Y
retried: ~
retries: 0
started: 2025-01-22T16:47:20.277566Z
state: failed
task: obs_rsync_run
time: 2025-01-23T10:07:07.353823Z
worker: 1894

however more recent failures look like this

---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:Z
  url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:Z/_result?package=000product
attempts: 1
children: []
created: 2025-01-23T10:06:00.665092Z
delayed: 2025-01-23T10:06:00.665092Z
expires: ~
finished: 2025-01-23T10:06:01.794449Z
id: 14225762
lax: 0
notes:
  gru_id: 39657486
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: |-
    rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
    rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1877) [Receiver=3.2.7]
    read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:Z in enviroment SUSE:ALP:Source:Standard:1.0:Staging:Z
retried: ~
retries: 0
started: 2025-01-23T10:06:00.668233Z
state: failed
task: obs_rsync_run
time: 2025-01-23T10:07:07.353823Z
worker: 1899

Suggestions

  • Confirm if this is a (temporary) network connectivity issue OR a case of repos deleted which are still getting picked up by OBS sync
  • Look into ignoring related failes OR adjusting repo configs
    • Or ask nicely if one of the maintainers would care to fix the config
  • Also check older failed minion jobs, consider filing new tickets if there are separate issues there
  • Look at schedules configured in

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17Blockedokurz2024-07-10

Actions
Copied to openQA Infrastructure (public) - action #176124: OSD influxdb minion route seemingly returns only a very small number of failed minion jobs, not allResolvedtinita

Actions
Actions #1

Updated by tinita 15 days ago

  • Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Actions #2

Updated by okurz 15 days ago

  • Parent task set to #161414
Actions #3

Updated by gpuliti 15 days ago

  • Subject changed from [alert] web UI: Too many Minion job failures alert to [alert] web UI: Too many Minion job failures alert size:S
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to livdywan
  • Parent task deleted (#161414)
Actions #4

Updated by gpuliti 15 days ago

  • Parent task set to #161414
Actions #5

Updated by tinita 14 days ago

What I find a bit weird is the number of failed minion jobs in grafana versus in the dashboard.
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&from=2025-01-22T18:45:14.947Z&to=2025-01-23T02:53:20.613Z&var-host_disks=$__all
There is this peak of 645 failed minion jobs for 4 hours, while having 13 before and 14 after.
The current dashboard https://openqa.suse.de/minion says 652 failed jobs.
However the influx route says 14:
https://openqa.suse.de/admin/influxdb/minion

openqa_minion_jobs,url=https://openqa.suse.de active=1i,delayed=2i,failed=14i,inactive=2i
openqa_minion_jobs_hook_rc_failed,...
...

So that's something to check, why the route only reports 14

Actions #6

Updated by okurz 14 days ago

  • Copied to action #176124: OSD influxdb minion route seemingly returns only a very small number of failed minion jobs, not all added
Actions #7

Updated by okurz 14 days ago

  • Tags set to infra, reactive work, osd, alert

I reported a separate ticket about the differing numbers #176124 so in the current ticket we can focus on the infra aspect

Actions #8

Updated by kraih 14 days ago

Did a quick check on OSD to see that the Minion API works as expected, and everything looks fine. The numbers between the Minion dashboard and the influxdb endpoint are different because of the ignored_failed_minion_jobs config.

$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->minion->jobs({states => [qw(failed)]})->total'
652

$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->config->{influxdb}->{ignored_failed_minion_jobs}'
[
  "obs_rsync_run",
  "obs_rsync_update_builds_text"
]

$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->minion->jobs({states => [qw(failed)], tasks => [qw(obs_rsync_run obs_rsync_update_builds_text)]})->total'
638

It's 14 because 638 failed jobs from those two tasks are ignored here. Perhaps it would make sense to expose an failed_but_ignored value on the influxdb endpoint to avoid this confusion in the future. :)

Actions #9

Updated by livdywan 14 days ago

For reference relevant thread in Slack.

Actions #10

Updated by livdywan 14 days ago

  • Description updated (diff)

For reference I found an affected test by searching for Marble (SUSE:ALP:Products:Marble:6.0 is too specific) and looking up tests/xfstests/install.pm on GitHub which allowed me to confirm that xfstests run on SL and more specifically openqa-trigger-from-ibs-plugin.
Might be nice if this was easier to track down? Like a straight-up link from the minion task. I knew what I was looking for but couldn't remember the name of the repo and still don't know exactly what the relevant changes are here. And GitLab won't easily reveal deleted files in the search.

Actions #11

Updated by livdywan 14 days ago

  • Status changed from Workable to In Progress

Currently going through relevant jobs and cleaning up very old ones.

Actions #12

Updated by livdywan 14 days ago ยท Edited

  • Status changed from In Progress to Feedback

Looks like there are build errors with Marble on ibs. Asking in Slack what the situation is here.

Actions #13

Updated by okurz 11 days ago

  • Assignee changed from livdywan to ybonatakis
Actions #14

Updated by jlausuch 11 days ago

About
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
to me it seems very sporadic thing. The directory is created for every build Maintenance Coordinators push to the project.
Currently, it exists: https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro-6.0-x86_64/repodata/
But maybe when the error appeared, the project was faulty and OBS-sync failed.. and then, another build fixed the issue...

I actually found a similar issue of a missing file and reported it:
https://suse.slack.com/archives/C067JPZF4JH/p1737558056228559

About
SUSE:ALP:Source:Standard:1.0:Staging:Y I have re-run the read_files.sh directly in OSD without any issue... not sure what happened, but it seems very sporadic. But difficult to know the reason without more logs.

Actions #15

Updated by ybonatakis 11 days ago

jlausuch wrote in #note-14:

About
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
to me it seems very sporadic thing. The directory is created for every build Maintenance Coordinators push to the project.
Currently, it exists: https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro-6.0-x86_64/repodata/
But maybe when the error appeared, the project was faulty and OBS-sync failed.. and then, another build fixed the issue...

I actually found a similar issue of a missing file and reported it:
https://suse.slack.com/archives/C067JPZF4JH/p1737558056228559

About
SUSE:ALP:Source:Standard:1.0:Staging:Y I have re-run the read_files.sh directly in OSD without any issue... not sure what happened, but it seems very sporadic. But difficult to know the reason without more logs.

I will quote jlausuch's proposal from slack for completeness on the above comment.
``would be nice that the minion prints the output of running read_files.sh, maybe with bash -x read_files ?

Actions #16

Updated by ybonatakis 11 days ago

As there are no as many failures the last few days, should we keep this with high priority?

Actions #17

Updated by tinita 11 days ago

ybonatakis wrote in #note-16:

As there are no as many failures the last few days, should we keep this with high priority?

We won't get alerted about it anyway, see #176124
The ini file corruption caused that, so the value we usually ignore wasn't ignored anymore.
So I would say it can be set to normal.
Theoretically we would have never been alerted anyway, so we wouldn't even know about this.

Actions #18

Updated by tinita 10 days ago

  • Priority changed from High to Normal
Actions #19

Updated by ybonatakis 10 days ago

Looking in the suggestions I dont know if there is anything else for this ticket. for the jobs I have seen failing from obs_sync tasks is mostly some staging builds (which work after a while. see previous messages and slack), and can be ignored and aka no new tickets.

I just noticed that ticket already marked with Feedback status. I am leaving it as is for @livdywan

Actions #20

Updated by okurz 9 days ago

  • Status changed from Feedback to Resolved

fine with that. we crosschecked and can resolve.

Actions

Also available in: Atom PDF