action #108989: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle::Save size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

action #108989

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle::Save size:S

Added by mkittler about 3 years ago. Updated 4 days ago.

Status:

Feedback

Priority:

Normal

Assignee:

gpuliti

Category:

Feature requests

Target version:

Ready

Start date:

2022-03-25

Due date:

2025-06-11 (Due in 10 days)

% Done:

Estimated time:

Description

Acceptance criteria¶

AC1: minion jobs "save_needle" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "save_needle"
AC3: If do_cleanup is set to no (the default) and the git repo is in a dirty state, then the minion job is aborted with a useful error message

Suggestions¶

Implement sigterm handler for "save_needle", similar as we did for example in https://github.com/os-autoinst/openQA/pull/4415/files
* Note that this is maybe not as important as for other jobs because these Git operations shouldn't take too long.
* If we go for it nevertheless, that would mean leaving the Git repository in a dirty state. Maybe we should simply allow that by doing an upfront cleanup of the checkout in these tasks (which would also help with #70774).
* Alternatively do a git restore . and other necessary cleanups and end the task to leave the clone in a clean state.
* Likely we can just abort without a retry here because the user will see an error message in the web UI anyways and can just try it again. (Retrying might actually be bad if the user also retries manually.)

Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #108989

Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Needle::Save size:S

Acceptance criteria¶

Suggestions¶

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz 2 months ago

Updated by okurz 2 months ago

Updated by okurz about 1 month ago

Updated by okurz 26 days ago

Updated by okurz 16 days ago

Updated by gpuliti 13 days ago

Updated by gpuliti 5 days ago

Updated by tinita 5 days ago

Updated by tinita 5 days ago

Updated by tinita 5 days ago

Updated by tinita 5 days ago

Updated by openqa_review 4 days ago

Updated by gpuliti 4 days ago

Updated by tinita 2 days ago