Project

General

Profile

Actions

action #107533

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:M

Added by okurz almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: minion job "finalize_job_results" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
  • AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments (or service restarts) for "finalize_job_results"

Suggestions


Related issues 1 (0 open1 closed)

Copied from openQA Project (public) - action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:MResolvedokurz

Actions
Actions #1

Updated by okurz almost 3 years ago

  • Copied from action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:M added
Actions #2

Updated by mkittler almost 3 years ago

Maybe the one archiving job I found today was killed as it took too long to handle SIGTERM because it was already at the point where everything was copied. At this point we continue regardless of the signal because it only needs to update one database row and delete the old results directory which both shouldn't take very long. I implemented it in this way to avoid leftover result directories for archived jobs so I wouldn't change it to abort here. Maybe we can extend the timeout before the jobs are killed.

Actions #3

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler almost 3 years ago

  • Status changed from New to In Progress
Actions #5

Updated by openqa_review almost 3 years ago

  • Due date set to 2022-03-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback
Actions #7

Updated by mkittler over 2 years ago

The change has been deployed today (11:12, CET). It looks good so far, the last jobs on https://openqa.suse.de/minion/jobs?state=failed&task=finalize_job_results are from 4 days ago.

Actions #8

Updated by mkittler over 2 years ago

  • Status changed from Feedback to Resolved

It looks still good, no further occurrences. So I'm resolving the issue.

Actions #9

Updated by okurz over 2 years ago

  • Due date deleted (2022-03-26)
Actions

Also available in: Atom PDF