action #108983
closed
coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S
Added by mkittler about 3 years ago.
Updated 1 day ago.
Category:
Feature requests
Description
Acceptance criteria¶
- AC1: minion job "schedule_iso" has a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "schedule_iso"
Suggestions¶
- Implement sigterm handler for "schedule_iso" like it has already been done for other jobs (e.g. #103416)
- Considering we made this a Minion task this whole operation might take a while so I suppose having a sigterm handler makes sense. (It wouldn't if it was fast anyways.)
- Since everything runs within a database transaction I suppose we can simply abort the transaction and try again later. Of course after the transaction has been committed we must not retry again to avoid scheduling jobs twice.
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed
- Target version set to future
- Related to action #110032: Migration group can not trigger, missing minion jobs? size:M added
- Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added
- Target version changed from future to Tools - Next
- Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule to Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::Schedule size:S
- Status changed from New to Workable
- Target version changed from Tools - Next to Ready
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
- Status changed from In Progress to Feedback
Set status in feedback as I was not able to test it manual and when/if this merged, we will need some time to see the results on production 🤞🏾. And this will complete AC2 eventually
- Status changed from Feedback to In Progress
- Due date set to 2025-04-26
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
- Related to action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S added
- Status changed from Feedback to In Progress
- Related to deleted (action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S)
- Status changed from In Progress to Feedback
merged. keep it in feedback for now
- Status changed from Feedback to Resolved
I checked also today and i didnt see any new occurences. I am closing this one
- Due date deleted (
2025-04-26)
Also available in: Atom
PDF