action #176013
closed
coordination #161414: [epic] Improved salt based infrastructure management
[alert] web UI: Too many Minion job failures alert size:S
Added by tinita 20 days ago.
Updated 14 days ago.
Category:
Regressions/Crashes
Description
Observation¶
Date: Wed, 22 Jan 2025 22:44:38 +0100
https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1
Looking at https://openqa.suse.de/minion/jobs?state=failed most failed jobs seem to be obs_rsync.
---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:Y
url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:Y/_result?package=000product
attempts: 1
children: []
created: 2025-01-22T16:47:20.274616Z
delayed: 2025-01-22T16:47:20.274616Z
expires: ~
finished: 2025-01-22T16:47:20.667998Z
id: 14216909
lax: 0
notes:
gru_id: 39649588
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:Y in enviroment
SUSE:ALP:Source:Standard:1.0:Staging:Y
retried: ~
retries: 0
started: 2025-01-22T16:47:20.277566Z
state: failed
task: obs_rsync_run
time: 2025-01-23T10:07:07.353823Z
worker: 1894
however more recent failures look like this
---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:Z
url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:Z/_result?package=000product
attempts: 1
children: []
created: 2025-01-23T10:06:00.665092Z
delayed: 2025-01-23T10:06:00.665092Z
expires: ~
finished: 2025-01-23T10:06:01.794449Z
id: 14225762
lax: 0
notes:
gru_id: 39657486
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: |-
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1877) [Receiver=3.2.7]
read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:Z in enviroment SUSE:ALP:Source:Standard:1.0:Staging:Z
retried: ~
retries: 0
started: 2025-01-23T10:06:00.668233Z
state: failed
task: obs_rsync_run
time: 2025-01-23T10:07:07.353823Z
worker: 1899
Suggestions¶
- Confirm if this is a (temporary) network connectivity issue OR a case of repos deleted which are still getting picked up by OBS sync
- Look into ignoring related failes OR adjusting repo configs
- Or ask nicely if one of the maintainers would care to fix the config
- Also check older failed minion jobs, consider filing new tickets if there are separate issues there
- Look at schedules configured in
- Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
- Parent task set to #161414
- Subject changed from [alert] web UI: Too many Minion job failures alert to [alert] web UI: Too many Minion job failures alert size:S
- Description updated (diff)
- Status changed from New to Workable
- Assignee set to livdywan
- Parent task deleted (
#161414)
- Parent task set to #161414
- Copied to action #176124: OSD influxdb minion route seemingly returns only a very small number of failed minion jobs, not all added
- Tags set to infra, reactive work, osd, alert
I reported a separate ticket about the differing numbers #176124 so in the current ticket we can focus on the infra aspect
Did a quick check on OSD to see that the Minion API works as expected, and everything looks fine. The numbers between the Minion dashboard and the influxdb endpoint are different because of the ignored_failed_minion_jobs
config.
$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->minion->jobs({states => [qw(failed)]})->total'
652
$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->config->{influxdb}->{ignored_failed_minion_jobs}'
[
"obs_rsync_run",
"obs_rsync_update_builds_text"
]
$ /usr/bin/perl /usr/share/openqa/script/openqa eval -V 'app->minion->jobs({states => [qw(failed)], tasks => [qw(obs_rsync_run obs_rsync_update_builds_text)]})->total'
638
It's 14 because 638 failed jobs from those two tasks are ignored here. Perhaps it would make sense to expose an failed_but_ignored
value on the influxdb endpoint to avoid this confusion in the future. :)
- Description updated (diff)
For reference I found an affected test by searching for Marble (SUSE:ALP:Products:Marble:6.0
is too specific) and looking up tests/xfstests/install.pm on GitHub which allowed me to confirm that xfstests run on SL and more specifically openqa-trigger-from-ibs-plugin.
Might be nice if this was easier to track down? Like a straight-up link from the minion task. I knew what I was looking for but couldn't remember the name of the repo and still don't know exactly what the relevant changes are here. And GitLab won't easily reveal deleted files in the search.
- Status changed from Workable to In Progress
Currently going through relevant jobs and cleaning up very old ones.
- Status changed from In Progress to Feedback
- Assignee changed from livdywan to ybonatakis
About
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
to me it seems very sporadic thing. The directory is created for every build Maintenance Coordinators push to the project.
Currently, it exists: https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro-6.0-x86_64/repodata/
But maybe when the error appeared, the project was faulty and OBS-sync failed.. and then, another build fixed the issue...
I actually found a similar issue of a missing file and reported it:
https://suse.slack.com/archives/C067JPZF4JH/p1737558056228559
About
SUSE:ALP:Source:Standard:1.0:Staging:Y
I have re-run the read_files.sh directly in OSD without any issue... not sure what happened, but it seems very sporadic. But difficult to know the reason without more logs.
jlausuch wrote in #note-14:
About
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
to me it seems very sporadic thing. The directory is created for every build Maintenance Coordinators push to the project.
Currently, it exists: https://download.suse.de/ibs/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/Z/images/repo/SL-Micro-6.0-x86_64/repodata/
But maybe when the error appeared, the project was faulty and OBS-sync failed.. and then, another build fixed the issue...
I actually found a similar issue of a missing file and reported it:
https://suse.slack.com/archives/C067JPZF4JH/p1737558056228559
About
SUSE:ALP:Source:Standard:1.0:Staging:Y
I have re-run the read_files.sh directly in OSD without any issue... not sure what happened, but it seems very sporadic. But difficult to know the reason without more logs.
I will quote jlausuch's proposal from slack for completeness on the above comment.
``would be nice that the minion prints the output of running read_files.sh, maybe with bash -x read_files ?
As there are no as many failures the last few days, should we keep this with high priority?
ybonatakis wrote in #note-16:
As there are no as many failures the last few days, should we keep this with high priority?
We won't get alerted about it anyway, see #176124
The ini file corruption caused that, so the value we usually ignore wasn't ignored anymore.
So I would say it can be set to normal.
Theoretically we would have never been alerted anyway, so we wouldn't even know about this.
- Priority changed from High to Normal
Looking in the suggestions I dont know if there is anything else for this ticket. for the jobs I have seen failing from obs_sync tasks is mostly some staging builds (which work after a while. see previous messages and slack), and can be ignored and aka no new tickets.
I just noticed that ticket already marked with Feedback status. I am leaving it as is for @livdywan
- Status changed from Feedback to Resolved
fine with that. we crosschecked and can resolve.
Also available in: Atom
PDF