action #108989
opencoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle::Save size:S
Description
Acceptance criteria¶
- AC1: minion jobs "save_needle" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
- AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "save_needle"
-
AC3: If
do_cleanup
is set tono
(the default) and the git repo is in a dirty state, then the minion job is aborted with a useful error message
Suggestions¶
Implement sigterm handler for "save_needle", similar as we did for example in https://github.com/os-autoinst/openQA/pull/4415/files
* Note that this is maybe not as important as for other jobs because these Git operations shouldn't take too long.
* If we go for it nevertheless, that would mean leaving the Git repository in a dirty state. Maybe we should simply allow that by doing an upfront cleanup of the checkout in these tasks (which would also help with #70774).
* Alternatively do a git restore .
and other necessary cleanups and end the task to leave the clone in a clean state.
* Likely we can just abort without a retry here because the user will see an error message in the web UI anyways and can just try it again. (Retrying might actually be bad if the user also retries manually.)
- Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed
Updated by mkittler about 3 years ago
- Tracker changed from coordination to action
- Category set to Feature requests
Updated by okurz 2 months ago
- Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle to Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 1 month ago
- Target version changed from Tools - Next to Ready
Updated by tinita 5 days ago
- Copied to action #183077: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle::Delete size:S added
Updated by tinita 5 days ago
- Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle size:S to Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle::Save size:S
- Description updated (diff)
Split out delete_needles to #183077
We found that this task is a bit bigger than we thought.
Needle::Save is already doing a cleanup (if configured) at the beginning, but Needle::Delete does not.
Updated by openqa_review 4 days ago
- Due date set to 2025-06-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by gpuliti 4 days ago
- Status changed from In Progress to Feedback
PR https://github.com/os-autoinst/openQA/pull/6483
I've add a signal handler to the save_needle
to better handling of failing tasks:
- the job will abort if no cleanup is configured to 'no'
- log are now update with a motivation and the error output
My change didn't trigger any failing tests, but to add some consistency I added some outline test.
Updated by tinita 2 days ago
- Related to action #179314: Improve git conflict handling in save_needle size:M added