action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

action #108983

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S

Added by mkittler about 3 years ago. Updated about 1 month ago.

Status:

Resolved

Priority:

Normal

Assignee:

ybonatakis

Category:

Feature requests

Target version:

Ready

Start date:

2022-03-25

Due date:

% Done:

Estimated time:

Description

Acceptance criteria¶

AC1: minion job "schedule_iso" has a sigterm handler to decide how to shut down in a clean way in a reasonable time
AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "schedule_iso"

Suggestions¶

Implement sigterm handler for "schedule_iso" like it has already been done for other jobs (e.g. #103416)
- Considering we made this a Minion task this whole operation might take a while so I suppose having a sigterm handler makes sense. (It wouldn't if it was fast anyways.)
- Since everything runs within a database transaction I suppose we can simply abort the transaction and try again later. Of course after the transaction has been committed we must not retry again to avoid scheduling jobs twice.
Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz about 3 years ago

Target version set to future

Actions

Copy link

Updated by mkittler about 3 years ago

Related to action #110032: Migration group can not trigger, missing minion jobs? size:M added

Actions

Copy link

Updated by tinita 8 months ago

Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added

Actions

Copy link

Updated by okurz 2 months ago

Target version changed from future to Tools - Next

Actions

Copy link

Updated by okurz 2 months ago

Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule to Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S
Status changed from New to Workable

Actions

Copy link

Updated by tinita about 2 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by ybonatakis about 2 months ago

Status changed from Workable to In Progress
Assignee set to ybonatakis

Actions

Copy link

Updated by ybonatakis about 2 months ago

https://github.com/os-autoinst/openQA/pull/6381

Actions

Copy link

Updated by ybonatakis about 2 months ago

Status changed from In Progress to Feedback

Set status in feedback as I was not able to test it manual and when/if this merged, we will need some time to see the results on production 🤞🏾. And this will complete AC2 eventually

Actions

Copy link

#10

Updated by ybonatakis about 1 month ago

Status changed from Feedback to In Progress

looking at https://github.com/os-autoinst/openQA/pull/6381#pullrequestreview-2756165022

Actions

Copy link

#11

Updated by openqa_review about 1 month ago

Due date set to 2025-04-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#12

Updated by ybonatakis about 1 month ago

Status changed from In Progress to Feedback

ybonatakis wrote in #note-10:

looking at https://github.com/os-autoinst/openQA/pull/6381#pullrequestreview-2756165022

I passed the minion instance down to the lib/OpenQA/Schema/Result/ScheduledProducts.pm in order to perform a retry if something interrupted after the job transaction to the db. This affects the lib/OpenQA/WebAPI/Controller/API/V1/Iso.pm which also use the schedule_iso.

Waiting for reviews

Actions

Copy link

#13

Updated by livdywan about 1 month ago

Related to action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S added

Actions

Copy link

#14

Updated by ybonatakis about 1 month ago

Status changed from Feedback to In Progress

in progress to address some feedback and particular https://github.com/os-autoinst/openQA/pull/6381#discussion_r2041570709

Actions

Copy link

#15

Updated by livdywan about 1 month ago

Related to deleted (action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S)

Actions

Copy link

#16

Updated by ybonatakis about 1 month ago

Status changed from In Progress to Feedback

Actions

Copy link

#17

Updated by ybonatakis about 1 month ago

merged. keep it in feedback for now

Actions

Copy link

#18

Updated by ybonatakis about 1 month ago

I checked the Minion failures. I did find Job terminated unexpectedly in https://openqa.suse.de/minion/jobs?task=schedule_iso&state=failed&queue=&note= .

however everything there is after the recovery from https://progress.opensuse.org/issues/181175. And the jobs(5) show that are created around 2025-04-22T14:28:30 UTC

we updated osd later at 3:11:48 PM GMT+2 and since no findings of Job terminated unexpectedly. I will check once again tomorrow and close it if I dont see anything. Maybe i could search also the db

Actions

Copy link

#19

Updated by ybonatakis about 1 month ago

Status changed from Feedback to Resolved

I checked also today and i didnt see any new occurences. I am closing this one

Actions

Copy link

#20

Updated by okurz about 1 month ago

Due date deleted (~~2025-04-26~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #108983

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S

Acceptance criteria¶

Suggestions¶

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by tinita 8 months ago

Updated by okurz 2 months ago

Updated by okurz 2 months ago

Updated by tinita about 2 months ago

Updated by ybonatakis about 2 months ago

Updated by ybonatakis about 2 months ago

Updated by ybonatakis about 2 months ago

Updated by ybonatakis about 1 month ago

Updated by openqa_review about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by livdywan about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by livdywan about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by ybonatakis about 1 month ago

Updated by okurz about 1 month ago