coordination #99831
opencoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
[epic] Better handle minion tasks failing with "Job terminated unexpectedly"
55%
Description
Motivation¶
Often we have alerts about "Too many Minion job failures". Some of them are with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)'
: This problem is independent of the task but is of course seen much more often on task we spawn many jobs of (e.g. finalize_job_result
tasks) and tasks with possibly long running jobs (e.g. limit_assets
). I suppose the error just means the Minion worker was restarted as signal 15 is SIGTERM
. Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
Alerts for Too many Minion job failures
with the result Job terminated unexpectedly (exit code: 0, signal: 15)
:
- This problem is seen when we spawn many jobs of (e.g. finalize_job_result
tasks) and tasks with possibly long running jobs (e.g. limit_assets
).
- The Minion worker was restarted as signal 15 is SIGTERM
.
Acceptance criteria¶
- AC1: All our minion jobs have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments
Suggestions¶
- Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
- Implement sigterm handler for each minion job
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed