Project

General

Profile

coordination #99831

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

[epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Added by okurz about 2 months ago. Updated 1 day ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-12-02
Due date:
2021-12-17
% Done:

0%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Motivation

Often we have alerts about "Too many Minion job failures". Some of them are with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)': This problem is independent of the task but is of course seen much more often on task we spawn many jobs of (e.g. finalize_job_result tasks) and tasks with possibly long running jobs (e.g. limit_assets). I suppose the error just means the Minion worker was restarted as signal 15 is SIGTERM. Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
Alerts for Too many Minion job failures with the result Job terminated unexpectedly (exit code: 0, signal: 15):
- This problem is seen when we spawn many jobs of (e.g. finalize_job_result tasks) and tasks with possibly long running jobs (e.g. limit_assets).
- The Minion worker was restarted as signal 15 is SIGTERM.

Acceptance criteria

  • AC1: All our minion jobs have a sigterm handler to decide how to shut down in a clean way in a reasonable time
  • AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments

Suggestions

  • Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
  • Implement sigterm handler for each minion job
  • Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Subtasks

action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:MIn Progressmkittler

History

#1 Updated by okurz 24 days ago

Discussed in 2021-11-10 as we currently again found some minion jobs failing with the above symptoms. Whenever processes are stopped or restarted it's likely that we hit an job_finalize_result job because we have many openQA jobs. These minion jobs likely only take some seconds so we should ensure that we have explicit TERM signal handling to either just finish the task or stop gracefully within a reasonable time, i.e. some seconds. Because the main minion job handler tries to bring down minion jobs gracefully we should be ok to just ignore the TERM signal.

#2 Updated by mkittler 24 days ago

Note that the default "KillMode" of systemd is control-group by default so SIGTERM is sent to all processes (see https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=). I suppose for a graceful termination of Minion jobs we needed to set it to mixed and possibly increase the time until SIGKILL is sent. Then the remaining timeouted jobs should be distinguishable as they've received SIGKILL (and not SIGTERM). Possibly the Minion framework could still be changed to make such cancelled jobs better distinguishable (and restart such jobs automatically depending on some setting of the job).

#3 Updated by okurz 3 days ago

  • Project changed from openQA Infrastructure to openQA Project
  • Target version changed from future to Ready

As we again have multiple minion job alerts just after deploying OSD I am adding this to the backlog

#4 Updated by cdywan 3 days ago

Indeed this is from this morning:

---
args: []
attempts: 1
children: []
created: 2021-11-30T23:00:05.58353Z
delayed: 2021-11-30T23:00:05.58353Z
expires: 2021-12-02T23:00:05.58353Z
finished: 2021-12-01T05:23:05.67069Z
id: 3538496
lax: 0
notes:
  gru_id: 30652445
parents: []
priority: 5
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2021-11-30T23:00:05.58784Z
state: failed
task: limit_results_and_logs
time: 2021-12-01T10:16:52.2364Z
worker: 575
> sudo grep --color=always limit_results_and_logs /var/log/apache2/error_log /var/log/apache2/error_log /var/log/openqa{,_gru}
/var/log/openqa_gru:[2021-12-01T00:00:05.597881+01:00] [debug] Process 31351 is performing job "3538496" with task "limit_results_and_logs"

#5 Updated by okurz 2 days ago

  • Tracker changed from action to coordination
  • Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" to [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
  • Description updated (diff)
  • Category set to Feature requests

#6 Updated by okurz 2 days ago

  • Status changed from New to Blocked
  • Assignee set to okurz

Also available in: Atom PDF