Project

General

Profile

action #91157

[Alerting] web UI: Too many Minion job failures alert: limit_results_and_logs failed

Added by Xiaojing_liu 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-04-15
Due date:
2021-04-30
% Done:

0%

Estimated time:
Difficulty:
Tags:

Description

Observation

Task limit_results_and_logs failed, the message shows that:

args: []
attempts: 1
children: []
created: 2021-04-14T22:00:06.22281Z
delayed: 2021-04-14T22:01:06.30156Z
expires: 2021-04-16T22:00:06.22281Z
finished: 2021-04-14T22:42:21.30191Z
id: 1683710
lax: 0
notes:
  gru_id: 28797918
parents: []
priority: 5
queue: default
result: |
  Can't call method "all" on unblessed reference at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobGroups.pm line 262.
retried: 2021-04-14T22:00:06.30156Z
retries: 1
started: 2021-04-14T22:01:08.42185Z
state: failed
task: limit_results_and_logs
time: 2021-04-15T01:51:50.84057Z
worker: 432

See details in https://openqa.suse.de/minion/jobs?state=failed&offset=0

History

#1 Updated by mkittler 3 months ago

  • Tags set to alert
  • Category set to Concrete Bugs

#2 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#4 Updated by mkittler 3 months ago

  • Status changed from New to In Progress

#5 Updated by okurz 3 months ago

  • Target version set to Ready

PR merged.

#6 Updated by openqa_review 3 months ago

  • Due date set to 2021-04-30

Setting due date based on mean cycle time of SUSE QE Tools

#7 Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

#8 Updated by mkittler 3 months ago

  • Status changed from Feedback to In Progress

My fix only landed on OSD with today's deployment so I'm re-triggering the last limit_results_and_logs job now. It is currently still active. I hope it'll work because we've already received the filesystem alert as results have been piling up.

I'll also check which group has been configured with a retention period of infinity. Having such a configuration triggered the bug but regardless of the bug being fixed now it might be questionable to have such a configuration.

#9 Updated by mkittler 3 months ago

The job group https://openqa.suse.de/admin/job_templates/276 had no/infinite retention periods for results configured. I've just set it to match its parent group (see https://openqa.suse.de/admin/job_templates/276) as we can not keep the results around forever.

The cleanup is meanwhile still running but could already bring the results under the alert threshold.

#10 Updated by mkittler 3 months ago

  • Status changed from In Progress to Resolved

The cleanup task succeeded again and we're still below the alert threshold.

Also available in: Atom PDF