action #118969
closed
[alert] web UI: Too many Minion job failures alert
Added by jbaier_cz about 2 years ago.
Updated about 2 years ago.
Category:
Feature requests
Description
E-mail alert about job failures, the job in question is download_assets
, see https://openqa.suse.de/minion/jobs?id=5564865
result: 'Downloading "http://carwos-runner.qa.suse.de/gitlab/Bogdan.Lezhepekov/carwos-Bogdan.Lezhepekov_branch_mr-1194442.qcow2"
failed with: Download of "/var/lib/openqa/share/factory/hdd/carwos-Bogdan.Lezhepekov_branch_mr-1194442.qcow2"
failed: 521 Connect timeout'
As of now, the address in question is really not reachable:
$ ping -c 1 carwos-runner.qa.suse.de
PING carwos-runner.qa.suse.de (10.161.50.2) 56(84) bytes of data.
--- carwos-runner.qa.suse.de ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
I already deleted a bunch of them, but there is more in the inactive queue, so I expected some more failures in the near future.
Related issues
1 (1 open — 0 closed)
- Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added
- Category changed from Organisational to Feature requests
- Priority changed from Normal to Urgent
First step: Disable the alert and notify the according machine owner. Second step: Those error messages would be good as "reason" within the according jobs that should incomplete. There should not be an alert alerting us as the test owners that specified the asset download URL need to handle that.
- Status changed from New to In Progress
Those error messages would be good as "reason" within the according jobs that should incomplete.
The jobs already end up incomplete. These GRU downloads predate the introduction of the reason field and therefore the error message is added as a job module result here (e.g. https://openqa.suse.de/tests/9746552). So should I change that to use the reason field instead? It would be more consistent, we wouldn't have the rather misleading reason "no test modules scheduled/uploaded" and could search for those jobs via DB queries.
There should not be an alert alerting us as the test owners that specified the asset download URL need to handle that.
I suppose the easiest way to prevent that is to avoid considering these Minion jobs failed.
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
- Resumed the alert
- PR has been merged and deployed
Also available in: Atom
PDF