coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by mkittler almost 4 years ago

Description updated (diff)

Actions

Copy link

#2

Updated by mkittler almost 4 years ago

Related to action #96197: [alert] web UI: Too many Minion job failures alert size:M added

Actions

Copy link

#3

Updated by okurz almost 4 years ago

Project changed from openQA Project (public) to openQA Infrastructure (public)
Target version set to future

Actions

Copy link

#4

Updated by mkittler over 3 years ago

Related to action #70774: save_needle Minion tasks fail frequently and needles could get lost added

Actions

Copy link

#5

Updated by okurz over 3 years ago

Related to action #98499: [alert] web UI: Too many Minion job failures alert size:S added

Actions

Copy link

#6

Updated by okurz over 3 years ago

Since commit 60ba6e1
Author: Amrysliu xiaojing.liu@suse.com
Date: Wed Jan 13 14:40:19 2021 +0800

Ignore specified failed minion jobs

See: https://progress.opensuse.org/issues/70768

it is possible to exclude certain minion tasks. With this it should be possible to exclude more just based on configuration

Actions

Copy link

#7

Updated by mkittler over 3 years ago

Related to action #70768: obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently added

Actions

Copy link

#8

Updated by mkittler over 3 years ago

Description updated (diff)

Actions

Copy link

#9

Updated by okurz over 3 years ago

Target version changed from future to Ready

After another OSD deployment another big chunk of minion jobs failing with "Job terminated unexpectedly (exit code: 0, signal: 15)" which we should handle in a better way. Adding ticket to backlog

Actions

Copy link

#10

Updated by okurz over 3 years ago

Priority changed from Normal to High

Actions

Copy link

#11

Updated by okurz over 3 years ago

#97979 was another place where we found that our minion job alert can be improved hence bumping prio to "High".

Actions

Copy link

#12

Updated by okurz over 3 years ago

Status changed from New to In Progress
Assignee set to okurz

we tried to estimate but maybe I can help to clarify first what we want to achieve

Actions

Copy link

#13

Updated by openqa_review over 3 years ago

Due date set to 2021-10-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#14

Updated by mkittler over 3 years ago

There are multiple points to be addressed. Maybe this should have been an epic. However, I agree with okurz's previous comment about getting started (#96263#note-9).

By the way, we're generating the data for InfluxDB via the openQA web application. This is not using direct SQL queries. I've recently checked the concrete code and Minion API we use so far. There seemed no easy way to do a filtered query via the Mojo API so one would have to loop through all failed jobs to filter jobs with the result Job terminated unexpectedly out. Maybe it is simpler and more performant to query Minion's database tables directly via PostgreSQL like we already do for other monitoring data.

Actions

Copy link

#15

Updated by okurz over 3 years ago

Tracker changed from action to coordination
Subject changed from Exclude certain Minion tasks from "Too many Minion job failures alert" alert to [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
Description updated (diff)

Actions

Copy link

#16

Updated by okurz over 3 years ago

Description updated (diff)

Actions

Copy link

#17

Updated by okurz over 3 years ago

Description updated (diff)

Actions

Copy link

#18

Updated by openqa_review over 3 years ago

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#19

Updated by okurz over 3 years ago

Description updated (diff)
Status changed from In Progress to Blocked

created four subtasks. One is in our backlog currently. Setting to "Blocked" on subtasks.

Actions

Copy link

#20

Updated by okurz over 3 years ago

Status changed from Blocked to New
Assignee deleted (~~okurz~~)
Target version changed from Ready to future

currently not to be considered for backlog due to backlog size-constraints

Actions

Copy link

#21

Updated by okurz over 3 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category set to Feature requests
Target version changed from future to Ready
Parent task set to #80142

Actions

Copy link

#22

Updated by okurz over 3 years ago

Status changed from New to Blocked
Assignee set to okurz

Actions

Copy link

#23

Updated by mkittler almost 3 years ago

Description updated (diff)

I added two more points to the ticket description (see #112193).

Actions

Copy link

#24

Updated by okurz almost 3 years ago

Status changed from Blocked to New
Assignee deleted (~~okurz~~)
Target version changed from Ready to future

Actions

Copy link

#25

Updated by jbaier_cz over 2 years ago

Related to action #118969: [alert] web UI: Too many Minion job failures alert added

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries