Project

General

Profile

Actions

action #103416

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-12-02
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: minion job "limit_results_and_logs" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
  • AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "limit_results_and_logs"

Suggestions

  • Implement sigterm handler for "limit_results_and_logs"
  • Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues 2 (0 open2 closed)

Related to openQA Project - action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:MResolvedmkittler2024-10-04

Actions
Copied to openQA Project - action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:MResolvedokurz

Actions
Actions #1

Updated by livdywan almost 3 years ago

  • Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" to Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
  • Status changed from New to Workable
Actions #2

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler almost 3 years ago

  • Status changed from Workable to In Progress
Actions #4

Updated by openqa_review almost 3 years ago

  • Due date set to 2021-12-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback

The PR has been merged. Let's see whether it helps. Note that this ticket is only about specific jobs so it is expected that not all unexpectedly terminating jobs are gone.

Actions #6

Updated by mkittler almost 3 years ago

https://openqa.suse.de/minion/jobs?task=limit_results_and_logs looks good so far. I've also checking on o3 where it looks good as well. (Then I have been cleaning up the Minion dashboard on o3 as there were over 400 failed jobs.)

Actions #7

Updated by mkittler almost 3 years ago

Looks like it works, there's already a retried Minion job (with the corresponding note) on OSD: https://openqa.suse.de/minion/jobs?id=3588093

Applying the same approach for other cleanup jobs is easy so I've just created a PR (not strictly part of this ticket): https://github.com/os-autoinst/openQA/pull/4396

Actions #8

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to Resolved

The other PR has been merged as well and the Minion dashboards still look good.

Actions #9

Updated by okurz almost 3 years ago

  • Copied to action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:M added
Actions #10

Updated by okurz almost 3 years ago

  • Due date deleted (2021-12-17)
Actions #11

Updated by tinita about 2 months ago

  • Related to action #167797: scripts-ci multimachine test CI job fails due to job incompleting with "minion failed" size:M added
Actions

Also available in: Atom PDF