Actions
action #108989
opencoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle
Start date:
2022-03-25
Due date:
% Done:
0%
Estimated time:
Description
Acceptance criteria¶
- AC1: minion jobs "delete_needles"/"save_needle" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "delete_needles"/"save_needle"
Suggestions¶
- Implement sigterm handler for "delete_needles"/"save_needle".
- Note that this is maybe not as important as for other jobs because these Git operations shouldn't take too long.
- If we go for it nevertheless, that would mean leaving the Git repository in a dirty state. Maybe we should simply allow that by doing an upfront cleanup of the checkout in these tasks (which would also help with #70774).
- Likely we can just abort without a retry here because the user will see an error message in the web UI anyways and can just try it again. (Retrying might actually be bad if the user also retries manually.)
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed
Actions