Project

General

Profile

action #102428

Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M

Added by cdywan 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-07-28
Due date:
2021-12-07
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

The Ratio of multi-machine tests is alerting.

ratio_mm_failed 5.360

Suggestion

  • Investigate what caused the ratio to turn
  • Maybe the investigation work on #101271 caused a higher fail-ratio

Related issues

Copied from openQA Project - action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:MResolved2021-07-282021-09-29

Copied to openQA Project - action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:MResolved

History

#1 Updated by cdywan 2 months ago

  • Copied from action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added

#2 Updated by okurz 2 months ago

  • Priority changed from Normal to Urgent

#3 Updated by okurz 2 months ago

The alert turned green again for now. Maybe the investigation work on #101271 caused a higher fail-ratio. Still, this should be investigated in more detail, e.g. take a look for the specific reason why multi-machine tests failed in the past two days.

#4 Updated by cdywan 2 months ago

  • Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.360 to Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by mkittler 2 months ago

The alert was firing today again (but has now already went off again).

#6 Updated by kraih 2 months ago

  • Assignee set to kraih

#7 Updated by kraih 2 months ago

  • Status changed from Workable to In Progress

Looking through the failed jobs.

#8 Updated by openqa_review 2 months ago

  • Due date set to 2021-12-07

Setting due date based on mean cycle time of SUSE QE Tools

#9 Updated by kraih 2 months ago

So far i'm not seeing any patterns, the reasons for why the tests failed are very diverse. Only recent-ish change is that there is more active test development for one of our newer products.

#10 Updated by kraih 2 months ago

I'll take a closer look at two 24 hour time frames, 2 days ago vs 30 days ago (when the ratio was around 3).

#11 Updated by kraih 2 months ago

  • Status changed from In Progress to Feedback

Checked about 200 multi machine jobs and the only noticeable change has really been new products being tested (and the tests being in active development). So i'm fairly certain there have been no larger underlying issues here, and we should (at least temporarily) consider increasing the alert value.

#12 Updated by kraih 2 months ago

  • Status changed from Feedback to Resolved

#13 Updated by cdywan about 2 months ago

  • Copied to action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M added

Also available in: Atom PDF