Project

General

Profile

Actions

action #108983

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S

Added by mkittler about 3 years ago. Updated 1 day ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-03-25
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: minion job "schedule_iso" has a sigterm handler to decide how to shut down in a clean way in a reasonable time
  • AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "schedule_iso"

Suggestions

  • Implement sigterm handler for "schedule_iso" like it has already been done for other jobs (e.g. #103416)
    • Considering we made this a Minion task this whole operation might take a while so I suppose having a sigterm handler makes sense. (It wouldn't if it was fast anyways.)
    • Since everything runs within a database transaction I suppose we can simply abort the transaction and try again later. Of course after the transaction has been committed we must not retry again to avoid scheduling jobs twice.
  • Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #110032: Migration group can not trigger, missing minion jobs? size:MResolvedkraih2022-05-03

Actions
Related to openQA Project (public) - action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:MResolvedmkittler2024-10-04

Actions
Actions #1

Updated by okurz about 3 years ago

  • Target version set to future
Actions #2

Updated by mkittler almost 3 years ago

  • Related to action #110032: Migration group can not trigger, missing minion jobs? size:M added
Actions #3

Updated by tinita 7 months ago

  • Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added
Actions #4

Updated by okurz about 1 month ago

  • Target version changed from future to Tools - Next
Actions #5

Updated by okurz about 1 month ago

  • Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule to Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S
  • Status changed from New to Workable
Actions #6

Updated by tinita 17 days ago

  • Target version changed from Tools - Next to Ready
Actions #7

Updated by ybonatakis 16 days ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #9

Updated by ybonatakis 16 days ago

  • Status changed from In Progress to Feedback

Set status in feedback as I was not able to test it manual and when/if this merged, we will need some time to see the results on production 🤞🏾. And this will complete AC2 eventually

Actions #10

Updated by ybonatakis 14 days ago

  • Status changed from Feedback to In Progress
Actions #11

Updated by openqa_review 14 days ago

  • Due date set to 2025-04-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by ybonatakis 11 days ago

  • Status changed from In Progress to Feedback

ybonatakis wrote in #note-10:

looking at https://github.com/os-autoinst/openQA/pull/6381#pullrequestreview-2756165022

I passed the minion instance down to the lib/OpenQA/Schema/Result/ScheduledProducts.pm in order to perform a retry if something interrupted after the job transaction to the db. This affects the lib/OpenQA/WebAPI/Controller/API/V1/Iso.pm which also use the schedule_iso.

Waiting for reviews

Actions #13

Updated by livdywan 11 days ago

  • Related to action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S added
Actions #14

Updated by ybonatakis 10 days ago

  • Status changed from Feedback to In Progress

in progress to address some feedback and particular https://github.com/os-autoinst/openQA/pull/6381#discussion_r2041570709

Actions #15

Updated by livdywan 10 days ago

  • Related to deleted (action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S)
Actions #16

Updated by ybonatakis 10 days ago

  • Status changed from In Progress to Feedback
Actions #17

Updated by ybonatakis 8 days ago

merged. keep it in feedback for now

Actions #18

Updated by ybonatakis 2 days ago

I checked the Minion failures. I did find Job terminated unexpectedly in https://openqa.suse.de/minion/jobs?task=schedule_iso&state=failed&queue=¬e= .

however everything there is after the recovery from https://progress.opensuse.org/issues/181175. And the jobs(5) show that are created around 2025-04-22T14:28:30 UTC

we updated osd later at 3:11:48 PM GMT+2 and since no findings of Job terminated unexpectedly. I will check once again tomorrow and close it if I dont see anything. Maybe i could search also the db

Actions #19

Updated by ybonatakis 1 day ago

  • Status changed from Feedback to Resolved

I checked also today and i didnt see any new occurences. I am closing this one

Actions #20

Updated by okurz 1 day ago

  • Due date deleted (2025-04-26)
Actions

Also available in: Atom PDF