Actions
action #100503
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
Identify all "finalize_job_results" failures and handle them (report ticket or fix)
Description
Motivation¶
We have failed minion jobs and related alerts. If a job would just fail due to a user-provided hook script it should not be our business but so far for OSD we do not have any of these. Otherwise we would know because that needs to be configured over salt. Mostly we configure the hook script for the investigation ourselves and want to be informed about problems. So we want to consider these failures after all.
Acceptance criteria¶
- AC1: All recent "finalize_job_result" failures are investigated and handled accordingly (ticket reported or fixed)
Suggestions¶
- Talk to tinita and cdywan about their recent experiences with job hook failure handling
- Lookup https://openqa.suse.de/minion/jobs?task=finalize_job_results&state=failed&queue=¬e= ( currently 4) and look into all issues and either report ticket if it's a new, not yet known issue, remove if there is a ticket already or fix directly
Out of scope¶
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
(#99831)
Actions