coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly" - openQA Project (public) - openSUSE Project Management Tool

Actions

coordination #99831

open

Actions

#1

Updated by okurz over 3 years ago

Discussed in 2021-11-10 as we currently again found some minion jobs failing with the above symptoms. Whenever processes are stopped or restarted it's likely that we hit an job_finalize_result job because we have many openQA jobs. These minion jobs likely only take some seconds so we should ensure that we have explicit TERM signal handling to either just finish the task or stop gracefully within a reasonable time, i.e. some seconds. Because the main minion job handler tries to bring down minion jobs gracefully we should be ok to just ignore the TERM signal.

Actions

#2

Updated by mkittler over 3 years ago

Note that the default "KillMode" of systemd is control-group by default so SIGTERM is sent to all processes (see https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=). I suppose for a graceful termination of Minion jobs we needed to set it to mixed and possibly increase the time until SIGKILL is sent. Then the remaining timeouted jobs should be distinguishable as they've received SIGKILL (and not SIGTERM). Possibly the Minion framework could still be changed to make such cancelled jobs better distinguishable (and restart such jobs automatically depending on some setting of the job).

Actions

#3

Updated by okurz over 3 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Target version changed from future to Ready

As we again have multiple minion job alerts just after deploying OSD I am adding this to the backlog

Actions

#4

Updated by livdywan over 3 years ago

Indeed this is from this morning:

---
args: []
attempts: 1
children: []
created: 2021-11-30T23:00:05.58353Z
delayed: 2021-11-30T23:00:05.58353Z
expires: 2021-12-02T23:00:05.58353Z
finished: 2021-12-01T05:23:05.67069Z
id: 3538496
lax: 0
notes:
  gru_id: 30652445
parents: []
priority: 5
queue: default
result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
retried: ~
retries: 0
started: 2021-11-30T23:00:05.58784Z
state: failed
task: limit_results_and_logs
time: 2021-12-01T10:16:52.2364Z
worker: 575

> sudo grep --color=always limit_results_and_logs /var/log/apache2/error_log /var/log/apache2/error_log /var/log/openqa{,_gru}
/var/log/openqa_gru:[2021-12-01T00:00:05.597881+01:00] [debug] Process 31351 is performing job "3538496" with task "limit_results_and_logs"

Actions

#5

Updated by okurz over 3 years ago

Tracker changed from action to coordination
Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" to [epic] Better handle minion tasks failing with "Job terminated unexpectedly"
Description updated (diff)
Category set to Feature requests

Actions

#6

Updated by okurz over 3 years ago

Status changed from New to Blocked
Assignee set to okurz

Actions

#7

Updated by okurz over 3 years ago

Status changed from Blocked to Feedback

With the one subtask resolved where mkittler also covered other jobs we can now check the situation after the next OSD deployments if there are minion jobs left with the symptom of "Job terminated unexpectedly". If all solved then we can go to a daily deployment or even higher frequency :)

Actions

#8

Updated by mkittler over 3 years ago

I was looking at the archiving code anyways (for #104136), so here a PR for the archiving task: https://github.com/os-autoinst/openQA/pull/4415

Actions

#9

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

After another deployment https://openqa.suse.de/minion/jobs?state=failed actually shows no entries. This is great!

Actions

#10

Updated by mkittler about 3 years ago

Status changed from Resolved to Feedback

Today I looked into Minion jobs and found further instances, e.g. one archiving job and many finalize jobs. I don't think this is solved until all jobs use the signal guard (also see AC1).

Actions

#11

Updated by okurz about 3 years ago

Status changed from Feedback to Blocked

ok, fine. Created a specific ticket #107533 for "job_finalize_result"

Actions

#12

Updated by okurz about 3 years ago

Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)

Actions

#13

Updated by mkittler about 3 years ago

Assignee set to mkittler

Actions

#14

Updated by mkittler about 3 years ago

Remaining tasks which are still missing the signal handler:

OpenQA::Task::Asset::Download
OpenQA::Task::Iso::Schedule
OpenQA::Task::Needle::Delete
OpenQA::Task::Needle::Save

Maybe 3. and 4. aren't that important because concluding the Git operation shouldn't take long (and therefore not exceed the timeout anyways). At least continuing those shouldn't be worse than trying to cleanup what has already changed so far. We could of course skip the cleanup and hard-reset the Git repository as first step in these tasks (so it wouldn't matter if there are half-done changes). That would also help with https://progress.opensuse.org/issues/70774.

Actions

#15

Updated by mkittler about 3 years ago

Status changed from Workable to Blocked

Blocked by newly created subtasks.

Actions

#16

Updated by mkittler about 3 years ago

There's still a small amount of finalize_job_results jobs (and even one archive_job_results job) failing with "Job terminated unexpectedly". So I suppose the small window before setting up the signal handler is in fact problematic. However, it makes likely more sense to work on the other sub tasks of the parent epic as these problems produce way more failed jobs now.

Actions

#17

Updated by okurz almost 3 years ago

Status changed from Blocked to New
Assignee deleted (~~mkittler~~)
Target version changed from Ready to future

still blocked by subtasks but all are currently not in the backlog hence removing this ticket from the backlog as well

Actions

#18

Updated by tinita 8 months ago

Subtask #167797 added

Actions

#19

Updated by okurz 8 months ago

Subtask #167911 added

Actions

#20

Updated by tinita 7 months ago

Subtask #169144 added

Actions

#21

Updated by okurz 2 months ago

Target version changed from future to Tools - Next

Actions

#22

Updated by tinita about 2 months ago

Subtask #179885 added

Actions

#23

Updated by tinita about 2 months ago

Related to action #179873: Long delays due to "git_clone" tasks added

Actions

Also available in: Atom PDF