action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

action #108983

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S

Added by mkittler about 3 years ago. Updated about 1 month ago.

Status:

Resolved

Priority:

Normal

Assignee:

ybonatakis

Category:

Feature requests

Target version:

Ready

Start date:

2022-03-25

Due date:

% Done:

Estimated time:

Description

Acceptance criteria¶

AC1: minion job "schedule_iso" has a sigterm handler to decide how to shut down in a clean way in a reasonable time
AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "schedule_iso"

Suggestions¶

Implement sigterm handler for "schedule_iso" like it has already been done for other jobs (e.g. #103416)
- Considering we made this a Minion task this whole operation might take a while so I suppose having a sigterm handler makes sense. (It wouldn't if it was fast anyways.)
- Since everything runs within a database transaction I suppose we can simply abort the transaction and try again later. Of course after the transaction has been committed we must not retry again to avoid scheduling jobs twice.
Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues 2 (0 open — 2 closed)

Related to openQA Project (public) - action #110032: Migration group can not trigger, missing minion jobs? size:M

Resolved

kraih

2022-05-03

Actions

Related to openQA Project (public) - action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M

Resolved

mkittler

2024-10-04

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz about 3 years ago

Target version set to future

Actions

Copy link

Updated by mkittler about 3 years ago

Related to action #110032: Migration group can not trigger, missing minion jobs? size:M added

Actions

Copy link

Updated by tinita 8 months ago

Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added

Actions

Copy link

Updated by okurz 2 months ago

Target version changed from future to Tools - Next

Actions

Copy link

Updated by okurz 2 months ago

Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule to Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S
Status changed from New to Workable

Actions

Copy link

Updated by tinita about 2 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by ybonatakis about 2 months ago

Status changed from Workable to In Progress
Assignee set to ybonatakis

Actions

Copy link

Updated by ybonatakis about 2 months ago

https://github.com/os-autoinst/openQA/pull/6381

Actions

Copy link

Updated by ybonatakis about 2 months ago

Status changed from In Progress to Feedback

Set status in feedback as I was not able to test it manual and when/if this merged, we will need some time to see the results on production 🤞🏾. And this will complete AC2 eventually

Actions

Copy link

#10

Updated by ybonatakis about 1 month ago

Status changed from Feedback to In Progress

looking at https://github.com/os-autoinst/openQA/pull/6381#pullrequestreview-2756165022

Actions

Copy link

#11

Updated by openqa_review about 1 month ago

Due date set to 2025-04-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#12

Updated by ybonatakis about 1 month ago

Status changed from In Progress to Feedback

ybonatakis wrote in #note-10:

looking at https://github.com/os-autoinst/openQA/pull/6381#pullrequestreview-2756165022

I passed the minion instance down to the lib/OpenQA/Schema/Result/ScheduledProducts.pm in order to perform a retry if something interrupted after the job transaction to the db. This affects the lib/OpenQA/WebAPI/Controller/API/V1/Iso.pm which also use the schedule_iso.

Waiting for reviews

Actions

Copy link

#13

Updated by livdywan about 1 month ago

Related to action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S added

Actions

Copy link

#14

Updated by ybonatakis about 1 month ago

Status changed from Feedback to In Progress

in progress to address some feedback and particular https://github.com/os-autoinst/openQA/pull/6381#discussion_r2041570709

Actions

Copy link

#15

Updated by livdywan about 1 month ago

Related to deleted (action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S)

Actions

Copy link

#16

Updated by ybonatakis about 1 month ago

Status changed from In Progress to Feedback

Actions

Copy link

#17

Updated by ybonatakis about 1 month ago

merged. keep it in feedback for now

Actions

Copy link

#18

Updated by ybonatakis about 1 month ago

I checked the Minion failures. I did find Job terminated unexpectedly in https://openqa.suse.de/minion/jobs?task=schedule_iso&state=failed&queue=&note= .

however everything there is after the recovery from https://progress.opensuse.org/issues/181175. And the jobs(5) show that are created around 2025-04-22T14:28:30 UTC

we updated osd later at 3:11:48 PM GMT+2 and since no findings of Job terminated unexpectedly. I will check once again tomorrow and close it if I dont see anything. Maybe i could search also the db

Actions

Copy link

#19

Updated by ybonatakis about 1 month ago

Status changed from Feedback to Resolved

I checked also today and i didnt see any new occurences. I am closing this one

Actions

Copy link

#20

Updated by okurz about 1 month ago

Due date deleted (~~2025-04-26~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #108983

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S

Acceptance criteria¶

Suggestions¶

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by tinita 8 months ago

Updated by okurz 2 months ago

Updated by okurz 2 months ago

Updated by tinita about 2 months ago

Updated by ybonatakis about 2 months ago

Updated by ybonatakis about 2 months ago

Updated by ybonatakis about 2 months ago

Updated by ybonatakis about 1 month ago

Updated by openqa_review about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by livdywan about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by livdywan about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by okurz about 1 month ago