Project

General

Profile

Actions

action #108983

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule

Added by mkittler over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
QA - future
Start date:
2022-03-25
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: minion job "schedule_iso" has a sigterm handler to decide how to shut down in a clean way in a reasonable time
  • AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "schedule_iso"

Suggestions

  • Implement sigterm handler for "schedule_iso" like it has already been done for other jobs (e.g. #103416)
    • Considering we made this a Minion task this whole operation might take a while so I suppose having a sigterm handler makes sense. (It wouldn't if it was fast anyways.)
    • Since everything runs within a database transaction I suppose we can simply abort the transaction and try again later. Of course after the transaction has been committed we must not retry again to avoid scheduling jobs twice.
  • Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues 2 (0 open2 closed)

Related to openQA Project - action #110032: Migration group can not trigger, missing minion jobs? size:MResolvedkraih2022-05-03

Actions
Related to openQA Project - action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:MResolvedmkittler2024-10-04

Actions
Actions #1

Updated by okurz over 2 years ago

  • Target version set to future
Actions #2

Updated by mkittler over 2 years ago

  • Related to action #110032: Migration group can not trigger, missing minion jobs? size:M added
Actions #3

Updated by tinita about 2 months ago

  • Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added
Actions

Also available in: Atom PDF