action #176013
closedcoordination #161414: [epic] Improved salt based infrastructure management
[alert] web UI: Too many Minion job failures alert size:S
0%
Description
Observation¶
Date: Wed, 22 Jan 2025 22:44:38 +0100
https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1
Looking at https://openqa.suse.de/minion/jobs?state=failed most failed jobs seem to be obs_rsync.
---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:Y
url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:Y/_result?package=000product
attempts: 1
children: []
created: 2025-01-22T16:47:20.274616Z
delayed: 2025-01-22T16:47:20.274616Z
expires: ~
finished: 2025-01-22T16:47:20.667998Z
id: 14216909
lax: 0
notes:
gru_id: 39649588
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:Y in enviroment
SUSE:ALP:Source:Standard:1.0:Staging:Y
retried: ~
retries: 0
started: 2025-01-22T16:47:20.277566Z
state: failed
task: obs_rsync_run
time: 2025-01-23T10:07:07.353823Z
worker: 1894
however more recent failures look like this
---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:Z
url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:Z/_result?package=000product
attempts: 1
children: []
created: 2025-01-23T10:06:00.665092Z
delayed: 2025-01-23T10:06:00.665092Z
expires: ~
finished: 2025-01-23T10:06:01.794449Z
id: 14225762
lax: 0
notes:
gru_id: 39657486
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: |-
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1877) [Receiver=3.2.7]
read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:Z in enviroment SUSE:ALP:Source:Standard:1.0:Staging:Z
retried: ~
retries: 0
started: 2025-01-23T10:06:00.668233Z
state: failed
task: obs_rsync_run
time: 2025-01-23T10:07:07.353823Z
worker: 1899
Suggestions¶
- Confirm if this is a (temporary) network connectivity issue OR a case of repos deleted which are still getting picked up by OBS sync
- Look into ignoring related failes OR adjusting repo configs
- Or ask nicely if one of the maintainers would care to fix the config
- Also check older failed minion jobs, consider filing new tickets if there are separate issues there
- Look at schedules configured in
Updated by tinita 15 days ago
- Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Updated by tinita 14 days ago
What I find a bit weird is the number of failed minion jobs in grafana versus in the dashboard.
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&from=2025-01-22T18:45:14.947Z&to=2025-01-23T02:53:20.613Z&var-host_disks=$__all
There is this peak of 645 failed minion jobs for 4 hours, while having 13 before and 14 after.
The current dashboard https://openqa.suse.de/minion says 652 failed jobs.
However the influx route says 14:
https://openqa.suse.de/admin/influxdb/minion
openqa_minion_jobs,url=https://openqa.suse.de active=1i,delayed=2i,failed=14i,inactive=2i
openqa_minion_jobs_hook_rc_failed,...
...
So that's something to check, why the route only reports 14
Updated by okurz 14 days ago
- Copied to action #176124: OSD influxdb minion route seemingly returns only a very small number of failed minion jobs, not all added
Updated by kraih 14 days ago
Did a quick check on OSD to see that the Minion API works as expected, and everything looks fine. The numbers between the Minion dashboard and the influxdb endpoint are different because of the ignored_failed_minion_jobs
config.
$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->minion->jobs({states => [qw(failed)]})->total'
652
$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->config->{influxdb}->{ignored_failed_minion_jobs}'
[
"obs_rsync_run",
"obs_rsync_update_builds_text"
]
$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->minion->jobs({states => [qw(failed)], tasks => [qw(obs_rsync_run obs_rsync_update_builds_text)]})->total'
638
It's 14 because 638 failed jobs from those two tasks are ignored here. Perhaps it would make sense to expose an failed_but_ignored
value on the influxdb endpoint to avoid this confusion in the future. :)
Updated by livdywan 14 days ago
- Description updated (diff)
For reference I found an affected test by searching for Marble (SUSE:ALP:Products:Marble:6.0
is too specific) and looking up tests/xfstests/install.pm on GitHub which allowed me to confirm that xfstests run on SL and more specifically openqa-trigger-from-ibs-plugin.
Might be nice if this was easier to track down? Like a straight-up link from the minion task. I knew what I was looking for but couldn't remember the name of the repo and still don't know exactly what the relevant changes are here. And GitLab won't easily reveal deleted files in the search.
Updated by jlausuch 11 days ago
About
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
to me it seems very sporadic thing. The directory is created for every build Maintenance Coordinators push to the project.
Currently, it exists: https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro-6.0-x86_64/repodata/
But maybe when the error appeared, the project was faulty and OBS-sync failed.. and then, another build fixed the issue...
I actually found a similar issue of a missing file and reported it:
https://suse.slack.com/archives/C067JPZF4JH/p1737558056228559
About
SUSE:ALP:Source:Standard:1.0:Staging:Y
I have re-run the read_files.sh directly in OSD without any issue... not sure what happened, but it seems very sporadic. But difficult to know the reason without more logs.
Updated by ybonatakis 11 days ago
jlausuch wrote in #note-14:
About
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
to me it seems very sporadic thing. The directory is created for every build Maintenance Coordinators push to the project.
Currently, it exists: https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro-6.0-x86_64/repodata/
But maybe when the error appeared, the project was faulty and OBS-sync failed.. and then, another build fixed the issue...I actually found a similar issue of a missing file and reported it:
https://suse.slack.com/archives/C067JPZF4JH/p1737558056228559About
SUSE:ALP:Source:Standard:1.0:Staging:Y
I have re-run the read_files.sh directly in OSD without any issue... not sure what happened, but it seems very sporadic. But difficult to know the reason without more logs.
I will quote jlausuch's proposal from slack for completeness on the above comment.
``would be nice that the minion prints the output of running read_files.sh, maybe with bash -x read_files ?
Updated by ybonatakis 11 days ago
As there are no as many failures the last few days, should we keep this with high priority?
Updated by tinita 11 days ago
ybonatakis wrote in #note-16:
As there are no as many failures the last few days, should we keep this with high priority?
We won't get alerted about it anyway, see #176124
The ini file corruption caused that, so the value we usually ignore wasn't ignored anymore.
So I would say it can be set to normal.
Theoretically we would have never been alerted anyway, so we wouldn't even know about this.
Updated by ybonatakis 10 days ago
Looking in the suggestions I dont know if there is anything else for this ticket. for the jobs I have seen failing from obs_sync tasks is mostly some staging builds (which work after a while. see previous messages and slack), and can be ignored and aka no new tickets.
I just noticed that ticket already marked with Feedback status. I am leaving it as is for @livdywan