Project

General

Profile

action #103416

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"

Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M

Added by okurz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-12-02
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Acceptance criteria

  • AC1: minion job "limit_results_and_logs" have a sigterm handler to decide how to shut down in a clean way in a reasonable time
  • AC2: Our minion job list on OSD and O3 do not show any "Job terminated unexpectedly" over multiple deployments for "limit_results_and_logs"

Suggestions

  • Implement sigterm handler for "limit_results_and_logs"
  • Test on o3 and osd either by manually restarting openqa-gru multiple times or awaiting the result from multiple deployments and checking the minion dashboard, e.g. openqa.suse.de/minion/jobs?state=failed

Related issues

Copied to openQA Project - action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:MResolved

History

#1 Updated by cdywan about 2 months ago

  • Subject changed from Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" to Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:M
  • Status changed from New to Workable

#2 Updated by mkittler about 2 months ago

  • Assignee set to mkittler

#3 Updated by mkittler about 2 months ago

  • Status changed from Workable to In Progress

#4 Updated by openqa_review about 2 months ago

  • Due date set to 2021-12-17

Setting due date based on mean cycle time of SUSE QE Tools

#5 Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

The PR has been merged. Let's see whether it helps. Note that this ticket is only about specific jobs so it is expected that not all unexpectedly terminating jobs are gone.

#6 Updated by mkittler about 1 month ago

https://openqa.suse.de/minion/jobs?task=limit_results_and_logs looks good so far. I've also checking on o3 where it looks good as well. (Then I have been cleaning up the Minion dashboard on o3 as there were over 400 failed jobs.)

#7 Updated by mkittler about 1 month ago

Looks like it works, there's already a retried Minion job (with the corresponding note) on OSD: https://openqa.suse.de/minion/jobs?id=3588093

Applying the same approach for other cleanup jobs is easy so I've just created a PR (not strictly part of this ticket): https://github.com/os-autoinst/openQA/pull/4396

#8 Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

The other PR has been merged as well and the Minion dashboards still look good.

#9 Updated by okurz about 1 month ago

  • Copied to action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:M added

#10 Updated by okurz about 1 month ago

  • Due date deleted (2021-12-17)

Also available in: Atom PDF