Project

General

Profile

Actions

action #102428

closed

Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M

Added by livdywan about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-28
Due date:
2021-12-07
% Done:

0%

Estimated time:

Description

Observation

The Ratio of multi-machine tests is alerting.

ratio_mm_failed 5.360

Suggestion

  • Investigate what caused the ratio to turn
  • Maybe the investigation work on #101271 caused a higher fail-ratio

Related issues 2 (0 open2 closed)

Copied from openQA Project (public) - action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:MResolvedokurz2021-07-282021-09-29

Actions
Copied to openQA Project (public) - action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:MResolvedmkittler

Actions
Actions #1

Updated by livdywan about 3 years ago

  • Copied from action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added
Actions #2

Updated by okurz about 3 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by okurz about 3 years ago

The alert turned green again for now. Maybe the investigation work on #101271 caused a higher fail-ratio. Still, this should be investigated in more detail, e.g. take a look for the specific reason why multi-machine tests failed in the past two days.

Actions #4

Updated by livdywan about 3 years ago

  • Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.360 to Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler about 3 years ago

The alert was firing today again (but has now already went off again).

Actions #6

Updated by kraih about 3 years ago

  • Assignee set to kraih
Actions #7

Updated by kraih about 3 years ago

  • Status changed from Workable to In Progress

Looking through the failed jobs.

Actions #8

Updated by openqa_review about 3 years ago

  • Due date set to 2021-12-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by kraih about 3 years ago

So far i'm not seeing any patterns, the reasons for why the tests failed are very diverse. Only recent-ish change is that there is more active test development for one of our newer products.

Actions #10

Updated by kraih about 3 years ago

I'll take a closer look at two 24 hour time frames, 2 days ago vs 30 days ago (when the ratio was around 3).

Actions #11

Updated by kraih about 3 years ago

  • Status changed from In Progress to Feedback

Checked about 200 multi machine jobs and the only noticeable change has really been new products being tested (and the tests being in active development). So i'm fairly certain there have been no larger underlying issues here, and we should (at least temporarily) consider increasing the alert value.

Actions #12

Updated by kraih about 3 years ago

  • Status changed from Feedback to Resolved
Actions #13

Updated by livdywan about 3 years ago

  • Copied to action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M added
Actions

Also available in: Atom PDF