action #103416
closed
coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
Added by okurz almost 3 years ago.
Updated almost 3 years ago.
Category:
Feature requests
Description
Acceptance criteria¶
- AC1: minion job "limit_results_and_logs" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "limit_results_and_logs"
Suggestions¶
- Implement sigterm handler for "limit_results_and_logs"
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed
- Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" to Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Due date set to 2021-12-17
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
The PR has been merged. Let's see whether it helps. Note that this ticket is only about specific jobs so it is expected that not all unexpectedly terminating jobs are gone.
- Status changed from Feedback to Resolved
The other PR has been merged as well and the Minion dashboards still look good.
- Copied to action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:M added
- Due date deleted (
2021-12-17)
- Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added
Also available in: Atom
PDF