coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly" - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

coordination #99831

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

[epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Added by okurz over 3 years ago. Updated about 20 hours ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Feature requests

Target version:

QA (public) - Tools - Next

Start date:

2021-12-02

Due date:

% Done:

70%

Estimated time:

(Total: 0.00 h)

Description

Motivation¶

Often we have alerts about "Too many Minion job failures". Some of them are with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)': This problem is independent of the task but is of course seen much more often on task we spawn many jobs of (e.g. finalize_job_result tasks) and tasks with possibly long running jobs (e.g. limit_assets). I suppose the error just means the Minion worker was restarted as signal 15 is SIGTERM. Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
Alerts for Too many Minion job failures with the result Job terminated unexpectedly (exit code: 0, signal: 15):
- This problem is seen when we spawn many jobs of (e.g. finalize_job_result tasks) and tasks with possibly long running jobs (e.g. limit_assets).
- The Minion worker was restarted as signal 15 is SIGTERM.

Acceptance criteria¶

AC1: All our minion jobs have a sigterm handler to decide how to shut down in a clean way in a reasonable time
AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments

Suggestions¶

Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
Implement sigterm handler for each minion job
Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Subtasks 10 (3 open — 7 closed)

action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M

Resolved

mkittler

2021-12-02

Actions

action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:M

Resolved

okurz

Actions

action #107533: Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:M

Resolved