action #176013: [alert] web UI: Too many Minion job failures alert size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #176013

closed

coordination #161414: [epic] Improved salt based infrastructure management

[alert] web UI: Too many Minion job failures alert size:S

Added by tinita 4 months ago. Updated 4 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

ybonatakis

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-01-23

Due date:

% Done:

Estimated time:

Tags:

alert, osd, infra, reactive work

Description

Observation¶

Date: Wed, 22 Jan 2025 22:44:38 +0100

https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1

Looking at https://openqa.suse.de/minion/jobs?state=failed most failed jobs seem to be obs_rsync.

---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:Y
  url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:Y/_result?package=000product
attempts: 1
children: []
created: 2025-01-22T16:47:20.274616Z
delayed: 2025-01-22T16:47:20.274616Z
expires: ~
finished: 2025-01-22T16:47:20.667998Z
id: 14216909
lax: 0
notes:
  gru_id: 39649588
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:Y in enviroment
    SUSE:ALP:Source:Standard:1.0:Staging:Y
retried: ~
retries: 0
started: 2025-01-22T16:47:20.277566Z
state: failed
task: obs_rsync_run
time: 2025-01-23T10:07:07.353823Z
worker: 1894

however more recent failures look like this

---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:Z
  url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:Z/_result?package=000product
attempts: 1
children: []
created: 2025-01-23T10:06:00.665092Z
delayed: 2025-01-23T10:06:00.665092Z
expires: ~
finished: 2025-01-23T10:06:01.794449Z
id: 14225762
lax: 0
notes:
  gru_id: 39657486
  project_lock: 1
parents: []
priority: 100
queue: default
result:
  code: 256
  message: |-
    rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
    rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1877) [Receiver=3.2.7]
    read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:Z in enviroment SUSE:ALP:Source:Standard:1.0:Staging:Z
retried: ~
retries: 0
started: 2025-01-23T10:06:00.668233Z
state: failed
task: obs_rsync_run
time: 2025-01-23T10:07:07.353823Z
worker: 1899

Suggestions¶

Confirm if this is a (temporary) network connectivity issue OR a case of repos deleted which are still getting picked up by OBS sync
Look into ignoring related failes OR adjusting repo configs
- Or ask nicely if one of the maintainers would care to fix the config
Also check older failed minion jobs, consider filing new tickets if there are separate issues there
Look at schedules configured in

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by tinita 4 months ago

Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added

Actions

Copy link

Updated by okurz 4 months ago

Parent task set to #161414

Actions

Copy link

Updated by gpuliti 4 months ago

Subject changed from [alert] web UI: Too many Minion job failures alert to [alert] web UI: Too many Minion job failures alert size:S
Description updated (diff)
Status changed from New to Workable
Assignee set to livdywan
Parent task deleted (~~#161414~~)

Actions

Copy link

Updated by gpuliti 4 months ago

Parent task set to #161414

Actions

Copy link

Updated by tinita 4 months ago

What I find a bit weird is the number of failed minion jobs in grafana versus in the dashboard.
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&from=2025-01-22T18:45:14.947Z&to=2025-01-23T02:53:20.613Z&var-host_disks=$__all
There is this peak of 645 failed minion jobs for 4 hours, while having 13 before and 14 after.
The current dashboard https://openqa.suse.de/minion says 652 failed jobs.
However the influx route says 14:
https://openqa.suse.de/admin/influxdb/minion

openqa_minion_jobs,url=https://openqa.suse.de active=1i,delayed=2i,failed=14i,inactive=2i
openqa_minion_jobs_hook_rc_failed,...
...

So that's something to check, why the route only reports 14

Actions

Copy link

Updated by okurz 4 months ago

Copied to action #176124: OSD influxdb minion route seemingly returns only a very small number of failed minion jobs, not all added

Actions

Copy link

Updated by okurz 4 months ago

Tags set to infra, reactive work, osd, alert

I reported a separate ticket about the differing numbers #176124 so in the current ticket we can focus on the infra aspect

Actions

Copy link

Updated by kraih 4 months ago

Did a quick check on OSD to see that the Minion API works as expected, and everything looks fine. The numbers between the Minion dashboard and the influxdb endpoint are different because of the ignored_failed_minion_jobs config.

$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->minion->jobs({states => [qw(failed)]})->total'
652

$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->config->{influxdb}->{ignored_failed_minion_jobs}'
[
  "obs_rsync_run",
  "obs_rsync_update_builds_text"
]

$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->minion->jobs({states => [qw(failed)], tasks => [qw(obs_rsync_run obs_rsync_update_builds_text)]})->total'
638

It's 14 because 638 failed jobs from those two tasks are ignored here. Perhaps it would make sense to expose an failed_but_ignored value on the influxdb endpoint to avoid this confusion in the future. :)

Actions

Copy link

Updated by livdywan 4 months ago

For reference relevant thread in Slack.

Actions

Copy link

#10

Updated by livdywan 4 months ago

Description updated (diff)

For reference I found an affected test by searching for Marble (SUSE:ALP:Products:Marble:6.0 is too specific) and looking up tests/xfstests/install.pm on GitHub which allowed me to confirm that xfstests run on SL and more specifically openqa-trigger-from-ibs-plugin.
Might be nice if this was easier to track down? Like a straight-up link from the minion task. I knew what I was looking for but couldn't remember the name of the repo and still don't know exactly what the relevant changes are here. And GitLab won't easily reveal deleted files in the search.

Actions

Copy link

#11

Updated by livdywan 4 months ago

Status changed from Workable to In Progress

Currently going through relevant jobs and cleaning up very old ones.

Actions

Copy link

#12

Updated by livdywan 4 months ago · Edited

Status changed from In Progress to Feedback

Looks like there are build errors with Marble on ibs. Asking in Slack what the situation is here.

Actions

Copy link

#13

Updated by okurz 4 months ago

Assignee changed from livdywan to ybonatakis

Actions

Copy link

#14

Updated by jlausuch 4 months ago

About
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
to me it seems very sporadic thing. The directory is created for every build Maintenance Coordinators push to the project.
Currently, it exists: https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro-6.0-x86_64/repodata/
But maybe when the error appeared, the project was faulty and OBS-sync failed.. and then, another build fixed the issue...

I actually found a similar issue of a missing file and reported it:
https://suse.slack.com/archives/C067JPZF4JH/p1737558056228559

About
SUSE:ALP:Source:Standard:1.0:Staging:Y I have re-run the read_files.sh directly in OSD without any issue... not sure what happened, but it seems very sporadic. But difficult to know the reason without more logs.

Actions

Copy link

#15

Updated by ybonatakis 4 months ago

jlausuch wrote in #note-14:

About
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
to me it seems very sporadic thing. The directory is created for every build Maintenance Coordinators push to the project.
Currently, it exists: https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro-6.0-x86_64/repodata/
But maybe when the error appeared, the project was faulty and OBS-sync failed.. and then, another build fixed the issue...

I actually found a similar issue of a missing file and reported it:
https://suse.slack.com/archives/C067JPZF4JH/p1737558056228559

About
SUSE:ALP:Source:Standard:1.0:Staging:Y I have re-run the read_files.sh directly in OSD without any issue... not sure what happened, but it seems very sporadic. But difficult to know the reason without more logs.

I will quote jlausuch's proposal from slack for completeness on the above comment.
```would be nice that the minion prints the output of running read_files.sh, maybe with bash -x read_files ?`

Actions

Copy link

#16

Updated by ybonatakis 4 months ago

As there are no as many failures the last few days, should we keep this with high priority?

Actions

Copy link

#17

Updated by tinita 4 months ago

ybonatakis wrote in #note-16:

As there are no as many failures the last few days, should we keep this with high priority?

We won't get alerted about it anyway, see #176124
The ini file corruption caused that, so the value we usually ignore wasn't ignored anymore.
So I would say it can be set to normal.
Theoretically we would have never been alerted anyway, so we wouldn't even know about this.

Actions

Copy link

#18

Updated by tinita 4 months ago

Priority changed from High to Normal

Actions

Copy link

#19

Updated by ybonatakis 4 months ago

Looking in the suggestions I dont know if there is anything else for this ticket. for the jobs I have seen failing from obs_sync tasks is mostly some staging builds (which work after a while. see previous messages and slack), and can be ignored and aka no new tickets.

I just noticed that ticket already marked with Feedback status. I am leaving it as is for @livdywan

Actions

Copy link

#20

Updated by okurz 4 months ago

Status changed from Feedback to Resolved

fine with that. we crosschecked and can resolve.

Actions

Copy link

#21

Updated by gpathak 3 months ago

Copied to action #177748: [alert] web UI: Too many Minion job failures alert Salt liA25iB4k added

Actions

Copy link

#22

Updated by gpathak 24 days ago

Copied to action #182021: [alert] web UI: Too many Minion job failures alert added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #176013

[alert] web UI: Too many Minion job failures alert size:S

Observation¶

Suggestions¶

Updated by tinita 4 months ago

Updated by okurz 4 months ago

Updated by gpuliti 4 months ago

Updated by gpuliti 4 months ago

Updated by tinita 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by kraih 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago · Edited

Updated by okurz 4 months ago

Updated by jlausuch 4 months ago

Updated by ybonatakis 4 months ago

Updated by ybonatakis 4 months ago

Updated by tinita 4 months ago

Updated by tinita 4 months ago

Updated by ybonatakis 4 months ago

Updated by okurz 4 months ago

Updated by gpathak 3 months ago

Updated by gpathak 24 days ago