action #163340
openOBSRSync regularily fails minion jobs - nobody cares, tools gets alerted (e.g. "Munin - minion Minion Jobs") size:M
0%
Description
Observation¶
emails with the subject Munin - minion Minion Jobs and content like this:
opensuse.org :: openqa.opensuse.org :: Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed
CRITICALs: failed is 501.00 (outside range [:500]).
We also see the same on OSD, see https://progress.opensuse.org/issues/163340#note-6 for some examples.
Looking at https://openqa.opensuse.org/minion/jobs?state=failed a lot of obs_rsync_run jobs fail:
---
args:
- project: openSUSE:Slowroll:Build:2
url: https://api.opensuse.org/public/build/openSUSE:Slowroll:Build:2/_result?package=000product
attempts: 1
children: []
created: 2024-07-04T08:33:32.788609Z
delayed: 2024-07-04T08:33:32.788609Z
expires: ~
finished: 2024-07-04T08:33:33.044735Z
id: 4068028
lax: 0
notes:
gru_id: 20378045
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: |-
rsync: [sender] change_dir "/openSUSE:Slowroll:Build:2/images/x86_64/kiwi-templates-Minimal:kvm-and-xen" (in openqa) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1835) [Receiver=3.2.3]
read_files.sh failed for openSUSE:Slowroll:Build:2 in enviroment openSUSE:Slowroll:Build:2
retried: ~
retries: 0
started: 2024-07-04T08:33:32.792054Z
state: failed
task: obs_rsync_run
time: 2024-07-04T11:25:28.671226Z
worker: 2253
Acceptance criteria¶
- AC1: We don't receive those e-mails anymore (unless there is really an actionable problem)
- AC2: Errors are visible on the web UI pages under "OBS Sync"
Suggestions¶
- Why do urls get configured when they are not present yet?
- Can we find out who added this? (what was added at all?)
- Interview that person and find out the use-case and why this was done
- Make the issue non-critical so we don't receive mails about failing minion jobs
- See #112871 for a ticket about changing the error handling and suppressing these errors
- Check https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openqa-trigger-from-ibs.sls?ref_type=heads
- See also https://gitlab.suse.de/openqa/openqa-trigger-from-ibs-plugin/-/tree/master/xml
- Show errors in the OBSRsync UI instead of the minion dashboard (maybe that's already the case, what is the "Last failure" column on https://openqa.opensuse.org/admin/obs_rsync for?)
- So maybe don't consider those Minion jobs failed? But that would mean they'd be clean up quite quickly so the links in the "Last failure" column will not be very useful. So maybe the relevant error message/logs should be saved elsewhere?
Rollback steps¶
Updated by nicksinger 6 months ago
- Copied from action #155743: OBSRSync fails to sync openSUSE:Factory:PowerPC:ToTest (was: WARNINGs: failed is 452.00 in Munin - minion Minion Jobs on o3) added
Updated by nicksinger 6 months ago
- Copied from deleted (action #155743: OBSRSync fails to sync openSUSE:Factory:PowerPC:ToTest (was: WARNINGs: failed is 452.00 in Munin - minion Minion Jobs on o3))
Updated by nicksinger 6 months ago
- Related to action #155743: OBSRSync fails to sync openSUSE:Factory:PowerPC:ToTest (was: WARNINGs: failed is 452.00 in Munin - minion Minion Jobs on o3) added
Updated by tinita 6 months ago
- Related to action #163067: [alert] Munin - minion Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed - opensuse.org :: openqa.opensuse.org added
Updated by nicksinger 6 months ago
- Description updated (diff)
Some examples from OSD as well:
---
args:
- alias: SUSE:SLFO:Products:SL-Micro:6.1:ToTest
attempts: 1
children: []
created: 2024-07-05T07:00:35.834961Z
delayed: 2024-07-05T07:00:35.834961Z
expires: ~
finished: 2024-07-05T07:00:36.373966Z
id: 12004122
lax: 0
notes:
gru_id: 37809476
parents: []
priority: 0
queue: default
result:
code: 256
message: ''
retried: ~
retries: 0
started: 2024-07-05T07:00:35.837364Z
state: failed
task: obs_rsync_update_builds_text
time: 2024-07-05T11:43:17.137624Z
worker: 1725
---
args:
- project: SUSE:ALP:Source:Standard:1.0:Staging:L
url: https://api.suse.de/build/SUSE:ALP:Source:Standard:1.0:Staging:L/_result?package=000product
attempts: 1
children: []
created: 2024-07-05T09:31:36.042208Z
delayed: 2024-07-05T09:31:36.042208Z
expires: ~
finished: 2024-07-05T09:31:37.036903Z
id: 12007191
lax: 0
notes:
gru_id: 37811813
project_lock: 1
parents: []
priority: 100
queue: default
result:
code: 256
message: |-
rsync: [sender] change_dir "/SUSE:/ALP:/Source:/Standard:/1.0:/Staging:/L/images/repo/SL-Micro*/repodata" (in repos) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1835) [Receiver=3.2.3]
read_files.sh failed for SUSE:ALP:Source:Standard:1.0:Staging:L in enviroment SUSE:ALP:Source:Standard:1.0:Staging:L
retried: ~
retries: 0
started: 2024-07-05T09:31:36.045513Z
state: failed
task: obs_rsync_run
time: 2024-07-05T11:43:17.137624Z
worker: 1725
Updated by nicksinger 6 months ago
- Related to action #112871: obs_rsync_run Minion tasks fail with no error message size:M added
Updated by livdywan 5 months ago
- Subject changed from OBSRSync regularily fails minion jobs - nobody cares, tools gets alerted (e.g. "Munin - minion Minion Jobs") to OBSRSync regularily fails minion jobs - nobody cares, tools gets alerted (e.g. "Munin - minion Minion Jobs") size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by openqa_review 5 months ago
- Due date set to 2024-08-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 5 months ago
Posted on #proj-agama about the wrong paths:
https://suse.slack.com/archives/C02TLF25571/p1721919187484699
Updated by dheidler 4 months ago
- Status changed from Workable to In Progress
The openqa-trigger-from-obs config was updated in https://github.com/os-autoinst/openqa-trigger-from-obs/commit/4f683924d171c4d3d6a7f1fce9b9f8db65c2ba00
Updated the checkout on ariel accordingly:
dheidler@ariel:/opt/openqa-trigger-from-obs> sudo rm -r systemsmanagement\:Agama\:Staging/
The systemsmanagement:Agama:Devel files seem to have been already updated.
Updated by dheidler 4 months ago
- Status changed from In Progress to Resolved
I'm closing this ticket now because technically the ACs are both fulfilled.
I fixed the concrete problem but did no further changes in openQA, because it is already possible to access the error in the minion job via the obs_sync page (last failure column).
Updated by okurz about 23 hours ago
- Status changed from Resolved to Feedback