Project

General

Profile

Actions

coordination #99831

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

[epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Added by okurz about 3 years ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
QA (public, currently private due to #173521) - future
Start date:
2021-12-02
Due date:
% Done:

55%

Estimated time:
(Total: 0.00 h)

Description

Motivation

Often we have alerts about "Too many Minion job failures". Some of them are with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)': This problem is independent of the task but is of course seen much more often on task we spawn many jobs of (e.g. finalize_job_result tasks) and tasks with possibly long running jobs (e.g. limit_assets). I suppose the error just means the Minion worker was restarted as signal 15 is SIGTERM. Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
Alerts for Too many Minion job failures with the result Job terminated unexpectedly (exit code: 0, signal: 15):
- This problem is seen when we spawn many jobs of (e.g. finalize_job_result tasks) and tasks with possibly long running jobs (e.g. limit_assets).
- The Minion worker was restarted as signal 15 is SIGTERM.

Acceptance criteria

  • AC1: All our minion jobs have a sigterm handler to decide how to shut down in a clean way in a reasonable time
  • AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments

Suggestions

  • Since such tasks are either not very important or triggered periodically we should review the different kind of jobs we have and decide if we should ignore those failures, turn them into "passed" (maybe need upstream feature) or handle somehow differently, e.g. retrigger automatically.
  • Implement sigterm handler for each minion job
  • Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Subtasks 9 (4 open5 closed)

action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:MResolvedmkittler2021-12-02

Actions
action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:MResolvedokurz

Actions
action #107533: Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:MResolvedmkittler

Actions
action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::DownloadNew2022-03-25

Actions
action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::ScheduleNew2022-03-25

Actions
action #108989: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::NeedleNew2022-03-25

Actions
action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:MResolvedmkittler2024-10-04

Actions
openQA Infrastructure (public) - action #167911: Scripts CI | Failed pipeline - openqa-schedule-mm-ping-test incompletes on o3Rejected2024-10-08

Actions
action #169144: Link to minion job from openQA job with waiting taskNew

Actions
Actions

Also available in: Atom PDF