Project

General

Profile

coordination #99831

Updated by okurz 7 months ago

## Motivation
Often we have alerts about "Too many Minion job failures". Some of them are with the result `'Job terminated unexpectedly (exit code: 0, signal: 15)'`: This problem is independent of the task but is of course seen much more often on task we spawn many jobs of (e.g. `finalize_job_result` tasks) and tasks with possibly long running jobs (e.g. `limit_assets`). I suppose the error just means the Minion worker was restarted as signal 15 is `SIGTERM`. Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
Alerts for `Too many Minion job failures` with the result `Job terminated unexpectedly (exit code: 0, signal: 15)`:
- This problem is seen when we spawn many jobs of (e.g. `finalize_job_result` tasks) and tasks with possibly long running jobs (e.g. `limit_assets`).
- The Minion worker was restarted as signal 15 is `SIGTERM`.

## Acceptance criteria
* **AC1:** All our minion jobs have a sigterm handler to decide how to shut down in a clean way in a reasonable time
* **AC2:** Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments

## Suggestions
* Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
* Implement sigterm handler for each minion job
* Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Back