action #103416
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
Acceptance criteria¶
- AC1: minion job "limit_results_and_logs" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "limit_results_and_logs"
- Implement sigterm handler for "limit_results_and_logs"
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g.
Updated by livdywan about 3 years ago
- Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" to Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
- Status changed from New to Workable
Updated by mkittler about 3 years ago
- Status changed from Workable to In Progress
Updated by openqa_review about 3 years ago
- Due date set to 2021-12-17
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 3 years ago
- Status changed from In Progress to Feedback
The PR has been merged. Let's see whether it helps. Note that this ticket is only about specific jobs so it is expected that not all unexpectedly terminating jobs are gone.
Updated by mkittler about 3 years ago looks good so far. I've also checking on o3 where it looks good as well. (Then I have been cleaning up the Minion dashboard on o3 as there were over 400 failed jobs.)
Updated by mkittler about 3 years ago
Looks like it works, there's already a retried Minion job (with the corresponding note) on OSD:
Applying the same approach for other cleanup jobs is easy so I've just created a PR (not strictly part of this ticket):
Updated by mkittler about 3 years ago
- Status changed from Feedback to Resolved
The other PR has been merged as well and the Minion dashboards still look good.
Updated by okurz about 3 years ago
- Copied to action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:M added
Updated by tinita 5 months ago
- Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added