action #100503
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
Identify all "finalize_job_results" failures and handle them (report ticket or fix)
Description
Motivation¶
We have failed minion jobs and related alerts. If a job would just fail due to a user-provided hook script it should not be our business but so far for OSD we do not have any of these. Otherwise we would know because that needs to be configured over salt. Mostly we configure the hook script for the investigation ourselves and want to be informed about problems. So we want to consider these failures after all.
Acceptance criteria¶
- AC1: All recent "finalize_job_result" failures are investigated and handled accordingly (ticket reported or fixed)
Suggestions¶
- Talk to tinita and cdywan about their recent experiences with job hook failure handling
- Lookup https://openqa.suse.de/minion/jobs?task=finalize_job_results&state=failed&queue=¬e= ( currently 4) and look into all issues and either report ticket if it's a new, not yet known issue, remove if there is a ticket already or fix directly
Out of scope¶
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
(#99831)
Updated by livdywan about 3 years ago
We discussed it briefly. It is not clear what the goal here is. If we want to catch failures in hook scripts, we need tickets about failures. And if we're not seeing alerts we need to look into that. We're also not sure why this is "High" when e.g. #99741 which is about minion failures that show as green. Also, it's unclear what user jobs refers to here.
Updated by okurz about 3 years ago
cdywan wrote:
We discussed it briefly. It is not clear what the goal here is.
Hm, the goal should have been clear by AC1: "All recent "finalize_job_result" failures are investigated and handled accordingly (ticket reported or fixed)". But let me try to rephrase if maybe it's not clear: There are no failing minion jobs of type "finalize_job_results" found on osd and o3 where the reason of failure is not already known and handled in other tickets
If we want to catch failures in hook scripts, we need tickets about failures.
well, if we currently don't have any such failures then this ticket is not that urgent
And if we're not seeing alerts we need to look into that.
yes but I am not aware of any specific problems about "jobs fail but no alerts".
We're also not sure why this is "High" when e.g. #99741 which is about minion failures that show as green.
It's mostly high because I thought you and tina are already working on related tasks. Not sure which ticket you handle those hook job failures in, maybe still the o3 Leap 15.3 upgrade ticket?
Regarding #99741 there you asked specifically for alerts on o3 which I see mostly blocked by #57239 hence #99741 is not in the backlog
Also, it's unclear what user jobs refers to here.
What "user jobs"?
Updated by tinita about 3 years ago
okurz wrote:
It's mostly high because I thought you and tina are already working on related tasks. Not sure which ticket you handle those hook job failures in, maybe still the o3 Leap 15.3 upgrade ticket?
Regarding #99741 there you asked specifically for alerts on o3 which I see mostly blocked by #57239 hence #99741 is not in the backlog
Why is #57239 blocking #99741?
#99519 is about missing comments. I investigated the problem, and one finding was that we don't catch failures in hook scripts as actual failures.
#99741 is a followup to that and the other failure regarding /bin/sh
.
Also, it's unclear what user jobs refers to here.
What "user jobs"?
The ticket description mentions "user-provided hook script"
Updated by livdywan about 3 years ago
- Status changed from New to Resolved
- Assignee set to livdywan
There are currently two jobs, both failing because of #99831 and no other ones. If we do see failures after all, we'll get alerts and handle those but we don't want to ignore these in general.