coordination #99831
opencoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
[epic] Better handle minion tasks failing with "Job terminated unexpectedly"
55%
Description
Motivation¶
Often we have alerts about "Too many Minion job failures". Some of them are with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)'
: This problem is independent of the task but is of course seen much more often on task we spawn many jobs of (e.g. finalize_job_result
tasks) and tasks with possibly long running jobs (e.g. limit_assets
). I suppose the error just means the Minion worker was restarted as signal 15 is SIGTERM
. Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
Alerts for Too many Minion job failures
with the result Job terminated unexpectedly (exit code: 0, signal: 15)
:
- This problem is seen when we spawn many jobs of (e.g. finalize_job_result
tasks) and tasks with possibly long running jobs (e.g. limit_assets
).
- The Minion worker was restarted as signal 15 is SIGTERM
.
Acceptance criteria¶
- AC1: All our minion jobs have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments
Suggestions¶
- Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
- Implement sigterm handler for each minion job
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed
Updated by okurz about 3 years ago
Discussed in 2021-11-10 as we currently again found some minion jobs failing with the above symptoms. Whenever processes are stopped or restarted it's likely that we hit an job_finalize_result job because we have many openQA jobs. These minion jobs likely only take some seconds so we should ensure that we have explicit TERM signal handling to either just finish the task or stop gracefully within a reasonable time, i.e. some seconds. Because the main minion job handler tries to bring down minion jobs gracefully we should be ok to just ignore the TERM signal.
Updated by mkittler about 3 years ago
Note that the default "KillMode" of systemd is control-group
by default so SIGTERM
is sent to all processes (see https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=). I suppose for a graceful termination of Minion jobs we needed to set it to mixed
and possibly increase the time until SIGKILL
is sent. Then the remaining timeouted jobs should be distinguishable as they've received SIGKILL
(and not SIGTERM
). Possibly the Minion framework could still be changed to make such cancelled jobs better distinguishable (and restart such jobs automatically depending on some setting of the job).
Updated by okurz about 3 years ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Target version changed from future to Ready
As we again have multiple minion job alerts just after deploying OSD I am adding this to the backlog
Updated by livdywan about 3 years ago
Indeed this is from this morning:
---
args: []
attempts: 1
children: []
created: 2021-11-30T23:00:05.58353Z
delayed: 2021-11-30T23:00:05.58353Z
expires: 2021-12-02T23:00:05.58353Z
finished: 2021-12-01T05:23:05.67069Z
id: 3538496
lax: 0
notes:
gru_id: 30652445
parents: []
priority: 5
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2021-11-30T23:00:05.58784Z
state: failed
task: limit_results_and_logs
time: 2021-12-01T10:16:52.2364Z
worker: 575
> sudo grep --color=always limit_results_and_logs /var/log/apache2/error_log /var/log/apache2/error_log /var/log/openqa{,_gru}
/var/log/openqa_gru:[2021-12-01T00:00:05.597881+01:00] [debug] Process 31351 is performing job "3538496" with task "limit_results_and_logs"
Updated by okurz about 3 years ago
- Tracker changed from action to coordination
- Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" to [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
- Description updated (diff)
- Category set to Feature requests
Updated by okurz about 3 years ago
- Status changed from New to Blocked
- Assignee set to okurz
Updated by okurz almost 3 years ago
- Status changed from Blocked to Feedback
With the one subtask resolved where mkittler also covered other jobs we can now check the situation after the next OSD deployments if there are minion jobs left with the symptom of "Job terminated unexpectedly". If all solved then we can go to a daily deployment or even higher frequency :)
Updated by mkittler almost 3 years ago
I was looking at the archiving code anyways (for #104136), so here a PR for the archiving task: https://github.com/os-autoinst/openQA/pull/4415
Updated by okurz almost 3 years ago
- Status changed from Feedback to Resolved
After another deployment https://openqa.suse.de/minion/jobs?state=failed actually shows no entries. This is great!
Updated by mkittler almost 3 years ago
- Status changed from Resolved to Feedback
Today I looked into Minion jobs and found further instances, e.g. one archiving job and many finalize jobs. I don't think this is solved until all jobs use the signal guard (also see AC1).
Updated by okurz almost 3 years ago
- Status changed from Feedback to Blocked
ok, fine. Created a specific ticket #107533 for "job_finalize_result"
Updated by okurz over 2 years ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
Updated by mkittler over 2 years ago
Remaining tasks which are still missing the signal handler:
- OpenQA::Task::Asset::Download
- OpenQA::Task::Iso::Schedule
- OpenQA::Task::Needle::Delete
- OpenQA::Task::Needle::Save
Maybe 3. and 4. aren't that important because concluding the Git operation shouldn't take long (and therefore not exceed the timeout anyways). At least continuing those shouldn't be worse than trying to cleanup what has already changed so far. We could of course skip the cleanup and hard-reset the Git repository as first step in these tasks (so it wouldn't matter if there are half-done changes). That would also help with https://progress.opensuse.org/issues/70774.
Updated by mkittler over 2 years ago
- Status changed from Workable to Blocked
Blocked by newly created subtasks.
Updated by mkittler over 2 years ago
There's still a small amount of finalize_job_results
jobs (and even one archive_job_results
job) failing with "Job terminated unexpectedly". So I suppose the small window before setting up the signal handler is in fact problematic. However, it makes likely more sense to work on the other sub tasks of the parent epic as these problems produce way more failed jobs now.
Updated by okurz over 2 years ago
- Status changed from Blocked to New
- Assignee deleted (
mkittler) - Target version changed from Ready to future
still blocked by subtasks but all are currently not in the backlog hence removing this ticket from the backlog as well