Project

General

Profile

coordination #96263

[epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

Added by mkittler 3 months ago. Updated 15 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
Start date:
2020-09-01
Due date:
% Done:

25%

Estimated time:
(Total: 0.00 h)

Description

There are certain Minion tasks which fail regularly but there's not much we can do about it.

  • #99837 All OBS rsync related tasks: Such failing tasks seem to have no impact; apparently it is sufficient if it works on the next attempt or users are able to fix problems themselves.
  • #70774 save_needle tasks: The needles dir on OSD might just be misconfigured by the user, e.g. we recently had lots of failing jobs for a needles directory for a new distribution. The users were often able to fix the problem themselves after they see the error (which is directly visible when saving a needle).
  • #100503 finalize_job_results tasks: I am not sure about this one. If the job just fails due to a user-provided hook script it should not be our business. On the other hand, we also configure the hook script for the investigation ourselves and want to be informed about problems. (So likely we want to consider these failures after all.)
  • #99831 jobs failing with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)': This problem is independent of the task but is of course seen much more often on task we spawn many jobs of (e.g. finalize_job_result tasks) and tasks with possibly long running jobs (e.g. limit_assets). I suppose the error just means the Minion worker was restarted as signal 15 is SIGTERM. Since such tasks are either not very important or triggered periodically we can likely just ignore those failed jobs.

Instead of implementing this on the monitoring-level we could also change openQA's behavior so these jobs are not considered failing. This would allow for a finer distinction (e.g. jobs would still fail if there's an unhanded exception due to a real regression). The disadvantage would be that all openQA instances would be affected. However, that's maybe not a bad thing and we can make it always configurable.


Subtasks

openQA Project - action #70774: save_needle Minion tasks fail frequentlyNew

action #99831: Better handle minion tasks failing with "Job terminated unexpectedly"New

action #99837: configurable exclusion rules for /influxdb/minionNew

action #100503: Identify all "finalize_job_results" failures and handle them (report ticket or fix)Resolvedcdywan


Related issues

Related to openQA Project - action #96197: [alert] web UI: Too many Minion job failures alert size:MResolved2021-07-28

Related to openQA Infrastructure - action #98499: [alert] web UI: Too many Minion job failures alert size:SResolved2021-09-13

Related to openQA Infrastructure - action #70768: obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequentlyResolved2020-09-01

History

#1 Updated by mkittler 3 months ago

  • Description updated (diff)

#2 Updated by mkittler 3 months ago

  • Related to action #96197: [alert] web UI: Too many Minion job failures alert size:M added

#3 Updated by okurz 3 months ago

  • Project changed from openQA Project to openQA Infrastructure
  • Target version set to future

#4 Updated by mkittler about 1 month ago

  • Related to action #70774: save_needle Minion tasks fail frequently added

#5 Updated by okurz about 1 month ago

  • Related to action #98499: [alert] web UI: Too many Minion job failures alert size:S added

#6 Updated by okurz about 1 month ago

Since commit 60ba6e1
Author: Amrysliu xiaojing.liu@suse.com
Date: Wed Jan 13 14:40:19 2021 +0800

Ignore specified failed minion jobs

See: https://progress.opensuse.org/issues/70768

it is possible to exclude certain minion tasks. With this it should be possible to exclude more just based on configuration

#7 Updated by mkittler about 1 month ago

  • Related to action #70768: obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequently added

#8 Updated by mkittler about 1 month ago

  • Description updated (diff)

#9 Updated by okurz about 1 month ago

  • Target version changed from future to Ready

After another OSD deployment another big chunk of minion jobs failing with "Job terminated unexpectedly (exit code: 0, signal: 15)" which we should handle in a better way. Adding ticket to backlog

#10 Updated by okurz 28 days ago

  • Priority changed from Normal to High

#11 Updated by okurz 28 days ago

#97979 was another place where we found that our minion job alert can be improved hence bumping prio to "High".

#12 Updated by okurz 27 days ago

  • Status changed from New to In Progress
  • Assignee set to okurz

we tried to estimate but maybe I can help to clarify first what we want to achieve

#13 Updated by openqa_review 26 days ago

  • Due date set to 2021-10-15

Setting due date based on mean cycle time of SUSE QE Tools

#14 Updated by mkittler 26 days ago

There are multiple points to be addressed. Maybe this should have been an epic. However, I agree with okurz's previous comment about getting started (#96263#note-9).


By the way, we're generating the data for InfluxDB via the openQA web application. This is not using direct SQL queries. I've recently checked the concrete code and Minion API we use so far. There seemed no easy way to do a filtered query via the Mojo API so one would have to loop through all failed jobs to filter jobs with the result Job terminated unexpectedly out. Maybe it is simpler and more performant to query Minion's database tables directly via PostgreSQL like we already do for other monitoring data.

#15 Updated by okurz 21 days ago

  • Tracker changed from action to coordination
  • Subject changed from Exclude certain Minion tasks from "Too many Minion job failures alert" alert to [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
  • Description updated (diff)

#16 Updated by okurz 21 days ago

  • Description updated (diff)

#17 Updated by okurz 21 days ago

  • Description updated (diff)

#18 Updated by openqa_review 20 days ago

Setting due date based on mean cycle time of SUSE QE Tools

#19 Updated by okurz 20 days ago

  • Description updated (diff)
  • Status changed from In Progress to Blocked

created four subtasks. One is in our backlog currently. Setting to "Blocked" on subtasks.

#20 Updated by okurz 15 days ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
  • Target version changed from Ready to future

currently not to be considered for backlog due to backlog size-constraints

Also available in: Atom PDF