action #118969
closed[alert] web UI: Too many Minion job failures alert
0%
Description
E-mail alert about job failures, the job in question is download_assets
, see https://openqa.suse.de/minion/jobs?id=5564865
result: 'Downloading "http://carwos-runner.qa.suse.de/gitlab/Bogdan.Lezhepekov/carwos-Bogdan.Lezhepekov_branch_mr-1194442.qcow2"
failed with: Download of "/var/lib/openqa/share/factory/hdd/carwos-Bogdan.Lezhepekov_branch_mr-1194442.qcow2"
failed: 521 Connect timeout'
As of now, the address in question is really not reachable:
$ ping -c 1 carwos-runner.qa.suse.de
PING carwos-runner.qa.suse.de (10.161.50.2) 56(84) bytes of data.
--- carwos-runner.qa.suse.de ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
I already deleted a bunch of them, but there is more in the inactive queue, so I expected some more failures in the near future.
Updated by jbaier_cz about 2 years ago
- Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added
Updated by okurz about 2 years ago
- Category changed from Organisational to Feature requests
- Priority changed from Normal to Urgent
First step: Disable the alert and notify the according machine owner. Second step: Those error messages would be good as "reason" within the according jobs that should incomplete. There should not be an alert alerting us as the test owners that specified the asset download URL need to handle that.
Updated by mkittler about 2 years ago
- Status changed from New to In Progress
I left a message in the chat. Likely the alert Incomplete jobs (not restarted) of last 24h alert is related as well.
Updated by mkittler about 2 years ago
Those error messages would be good as "reason" within the according jobs that should incomplete.
The jobs already end up incomplete. These GRU downloads predate the introduction of the reason field and therefore the error message is added as a job module result here (e.g. https://openqa.suse.de/tests/9746552). So should I change that to use the reason field instead? It would be more consistent, we wouldn't have the rather misleading reason "no test modules scheduled/uploaded" and could search for those jobs via DB queries.
There should not be an alert alerting us as the test owners that specified the asset download URL need to handle that.
I suppose the easiest way to prevent that is to avoid considering these Minion jobs failed.
Updated by mkittler about 2 years ago
- Cleaned the Minion dashboard
- PR: https://github.com/os-autoinst/openQA/pull/4844
Updated by mkittler about 2 years ago
- Status changed from In Progress to Feedback
Updated by mkittler about 2 years ago
- Status changed from Feedback to Resolved
- Resumed the alert
- PR has been merged and deployed