Project

General

Profile

Actions

coordination #96263

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

Added by mkittler almost 3 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-09-01
Due date:
% Done:

38%

Estimated time:
(Total: 0.00 h)

Description

There are certain Minion tasks which fail regularly but there's not much we can do about it.

  • #99837 All OBS rsync related tasks: Such failing tasks seem to have no impact; apparently it is sufficient if it works on the next attempt or users are able to fix problems themselves.
  • #70774 save_needle tasks: The needles dir on OSD might just be misconfigured by the user, e.g. we recently had lots of failing jobs for a needles directory for a new distribution. The users were often able to fix the problem themselves after they see the error (which is directly visible when saving a needle).
  • #100503 finalize_job_results tasks: I am not sure about this one. If the job just fails due to a user-provided hook script it should not be our business. On the other hand, we also configure the hook script for the investigation ourselves and want to be informed about problems. (So likely we want to consider these failures after all.)
  • #99831 jobs failing with the result 'Job terminated unexpectedly (exit code: 0, signal: 15)': This problem is independent of the task but is of course seen much more often on task we spawn many jobs of (e.g. finalize_job_result tasks) and tasks with possibly long running jobs (e.g. limit_assets). I suppose the error just means the Minion worker was restarted as signal 15 is SIGTERM. Since such tasks are either not very important or triggered periodically we can likely just ignore those failed jobs.
  • jobs failing with result result: Worker went away: This problem is independent of the task and similar to the previous point. Not sure under which condition jobs with that result are produced but it is apparently nothing our current handling for SIGTERM (for the previous point) covers.
  • jobs failing with PostgreSQL connection error: This problem is also independent of the task as PostgreSQL is basically used everywhere. I suppose such failures happen when the database is restarted or shortly not available for some reason.

Instead of implementing this on the monitoring-level we could also change openQA's behavior so these jobs are not considered failing. This would allow for a finer distinction (e.g. jobs would still fail if there's an unhanded exception due to a real regression). The disadvantage would be that all openQA instances would be affected. However, that's maybe not a bad thing and we can make it always configurable.


Subtasks 10 (6 open4 closed)

action #70774: save_needle Minion tasks fail frequentlyNew2020-09-01

Actions
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"New2021-12-02

Actions
action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:MResolvedmkittler2021-12-02

Actions
action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:MResolvedokurz

Actions
action #107533: Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:MResolvedmkittler

Actions
action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::DownloadNew2022-03-25

Actions
action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::ScheduleNew2022-03-25

Actions
action #108989: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::NeedleNew2022-03-25

Actions
action #99837: configurable exclusion rules for /influxdb/minionNew2021-10-06

Actions
action #100503: Identify all "finalize_job_results" failures and handle them (report ticket or fix)Resolvedlivdywan2021-10-07

Actions

Related issues 4 (0 open4 closed)

Related to openQA Project - action #96197: [alert] web UI: Too many Minion job failures alert size:MResolvedmkittler2021-07-28

Actions
Related to openQA Infrastructure - action #98499: [alert] web UI: Too many Minion job failures alert size:SResolvedmkittler2021-09-13

Actions
Related to openQA Infrastructure - action #70768: obs_rsync_run and obs_rsync_update_builds_text Minion tasks fail frequentlyResolvedXiaojing_liu2020-09-01

Actions
Related to openQA Project - action #118969: [alert] web UI: Too many Minion job failures alertResolvedmkittler2022-10-17

Actions
Actions

Also available in: Atom PDF