[Alerting] web UI: Too many Minion job failures alert: limit_results_and_logs failed
Task limit_results_and_logs failed, the message shows that:
args:  attempts: 1 children:  created: 2021-04-14T22:00:06.22281Z delayed: 2021-04-14T22:01:06.30156Z expires: 2021-04-16T22:00:06.22281Z finished: 2021-04-14T22:42:21.30191Z id: 1683710 lax: 0 notes: gru_id: 28797918 parents:  priority: 5 queue: default result: | Can't call method "all" on unblessed reference at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobGroups.pm line 262. retried: 2021-04-14T22:00:06.30156Z retries: 1 started: 2021-04-14T22:01:08.42185Z state: failed task: limit_results_and_logs time: 2021-04-15T01:51:50.84057Z worker: 432
See details in https://openqa.suse.de/minion/jobs?state=failed&offset=0
- Status changed from Feedback to In Progress
My fix only landed on OSD with today's deployment so I'm re-triggering the last
limit_results_and_logs job now. It is currently still active. I hope it'll work because we've already received the filesystem alert as results have been piling up.
I'll also check which group has been configured with a retention period of infinity. Having such a configuration triggered the bug but regardless of the bug being fixed now it might be questionable to have such a configuration.
The job group https://openqa.suse.de/admin/job_templates/276 had no/infinite retention periods for results configured. I've just set it to match its parent group (see https://openqa.suse.de/admin/job_templates/276) as we can not keep the results around forever.
The cleanup is meanwhile still running but could already bring the results under the alert threshold.