Project

General

Profile

Actions

action #100503

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

Identify all "finalize_job_results" failures and handle them (report ticket or fix)

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2021-10-07
Due date:
% Done:

0%

Estimated time:

Description

Motivation

We have failed minion jobs and related alerts. If a job would just fail due to a user-provided hook script it should not be our business but so far for OSD we do not have any of these. Otherwise we would know because that needs to be configured over salt. Mostly we configure the hook script for the investigation ourselves and want to be informed about problems. So we want to consider these failures after all.

Acceptance criteria

  • AC1: All recent "finalize_job_result" failures are investigated and handled accordingly (ticket reported or fixed)

Suggestions

Out of scope

  • result: 'Job terminated unexpectedly (exit code: 0, signal: 15)' (#99831)
Actions #1

Updated by livdywan about 3 years ago

We discussed it briefly. It is not clear what the goal here is. If we want to catch failures in hook scripts, we need tickets about failures. And if we're not seeing alerts we need to look into that. We're also not sure why this is "High" when e.g. #99741 which is about minion failures that show as green. Also, it's unclear what user jobs refers to here.

Actions #2

Updated by okurz about 3 years ago

cdywan wrote:

We discussed it briefly. It is not clear what the goal here is.

Hm, the goal should have been clear by AC1: "All recent "finalize_job_result" failures are investigated and handled accordingly (ticket reported or fixed)". But let me try to rephrase if maybe it's not clear: There are no failing minion jobs of type "finalize_job_results" found on osd and o3 where the reason of failure is not already known and handled in other tickets

If we want to catch failures in hook scripts, we need tickets about failures.

well, if we currently don't have any such failures then this ticket is not that urgent

And if we're not seeing alerts we need to look into that.

yes but I am not aware of any specific problems about "jobs fail but no alerts".

We're also not sure why this is "High" when e.g. #99741 which is about minion failures that show as green.

It's mostly high because I thought you and tina are already working on related tasks. Not sure which ticket you handle those hook job failures in, maybe still the o3 Leap 15.3 upgrade ticket?

Regarding #99741 there you asked specifically for alerts on o3 which I see mostly blocked by #57239 hence #99741 is not in the backlog

Also, it's unclear what user jobs refers to here.

What "user jobs"?

Actions #3

Updated by tinita about 3 years ago

okurz wrote:

It's mostly high because I thought you and tina are already working on related tasks. Not sure which ticket you handle those hook job failures in, maybe still the o3 Leap 15.3 upgrade ticket?

#99519

Regarding #99741 there you asked specifically for alerts on o3 which I see mostly blocked by #57239 hence #99741 is not in the backlog

Why is #57239 blocking #99741?

#99519 is about missing comments. I investigated the problem, and one finding was that we don't catch failures in hook scripts as actual failures.

#99741 is a followup to that and the other failure regarding /bin/sh.

Also, it's unclear what user jobs refers to here.

What "user jobs"?

The ticket description mentions "user-provided hook script"

Actions #4

Updated by livdywan about 3 years ago

  • Description updated (diff)
Actions #5

Updated by livdywan about 3 years ago

  • Description updated (diff)
Actions #6

Updated by livdywan about 3 years ago

  • Status changed from New to Resolved
  • Assignee set to livdywan

There are currently two jobs, both failing because of #99831 and no other ones. If we do see failures after all, we'll get alerts and handle those but we don't want to ignore these in general.

Actions

Also available in: Atom PDF