coordination #96263

Updated by mkittler 4 months ago

There are certain Minion tasks which fail regularly but there's not much we can do about it.

* All OBS rsync related tasks: Such failing tasks seem to have no impact; apparently it is sufficient if it works on the next attempt or users are able to fix problems themselves.
* `save_needle` tasks: The needles dir on OSD might just be misconfigured by the user, e.g. we recently had lots of failing jobs for a needles directory for a new distribution. The users were often able to fix the problem themselves after they see the error (which is directly visible when saving a needle).
* `finalize_job_results` tasks: I am not sure about this one. If the job just fails due to a user-provided hook script it should not be our business. On the other hand, we also configure the hook script for the investigation ourselves and want to be informed about problems. (So likely we want to consider these failures after all.)


Instead of implementing this on the monitoring-level we could also change openQA's behavior so these jobs are not considered failing. This would allow for a finer distinction (e.g. jobs would still fail if there's an unhanded exception due to a real regression). The disadvantage would be that all openQA instances would be affected. However, that's maybe not a bad thing and we can make it always configurable.