action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #103416

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2021-12-02

Due date:

% Done:

Estimated time:

Description

Acceptance criteria¶

AC1: minion job "limit_results_and_logs" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "limit_results_and_logs"

Suggestions¶

Implement sigterm handler for "limit_results_and_logs"
Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by livdywan over 3 years ago

Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" to Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 3 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from Workable to In Progress

PR: https://github.com/os-autoinst/openQA/pull/4383

Actions

Copy link

Updated by openqa_review over 3 years ago

Due date set to 2021-12-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from In Progress to Feedback

The PR has been merged. Let's see whether it helps. Note that this ticket is only about specific jobs so it is expected that not all unexpectedly terminating jobs are gone.

Actions

Copy link

Updated by mkittler over 3 years ago

https://openqa.suse.de/minion/jobs?task=limit_results_and_logs looks good so far. I've also checking on o3 where it looks good as well. (Then I have been cleaning up the Minion dashboard on o3 as there were over 400 failed jobs.)

Actions

Copy link

Updated by mkittler over 3 years ago

Looks like it works, there's already a retried Minion job (with the corresponding note) on OSD: https://openqa.suse.de/minion/jobs?id=3588093

Applying the same approach for other cleanup jobs is easy so I've just created a PR (not strictly part of this ticket): https://github.com/os-autoinst/openQA/pull/4396

Actions

Copy link