action #102428
closedProvide "fail-rate" alerting with ratio_mm_failed 5.360 size:M
Description
Observation¶
The Ratio of multi-machine tests is alerting.
ratio_mm_failed 5.360
Suggestion¶
- Investigate what caused the ratio to turn
- Maybe the investigation work on #101271 caused a higher fail-ratio
Updated by livdywan almost 3 years ago
- Copied from action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added
Updated by okurz almost 3 years ago
The alert turned green again for now. Maybe the investigation work on #101271 caused a higher fail-ratio. Still, this should be investigated in more detail, e.g. take a look for the specific reason why multi-machine tests failed in the past two days.
Updated by livdywan almost 3 years ago
- Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.360 to Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler almost 3 years ago
The alert was firing today again (but has now already went off again).
Updated by kraih almost 3 years ago
- Status changed from Workable to In Progress
Looking through the failed jobs.
Updated by openqa_review almost 3 years ago
- Due date set to 2021-12-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by kraih almost 3 years ago
So far i'm not seeing any patterns, the reasons for why the tests failed are very diverse. Only recent-ish change is that there is more active test development for one of our newer products.
Updated by kraih almost 3 years ago
I'll take a closer look at two 24 hour time frames, 2 days ago vs 30 days ago (when the ratio was around 3).
Updated by kraih almost 3 years ago
- Status changed from In Progress to Feedback
Checked about 200 multi machine jobs and the only noticeable change has really been new products being tested (and the tests being in active development). So i'm fairly certain there have been no larger underlying issues here, and we should (at least temporarily) consider increasing the alert value.
Updated by livdywan almost 3 years ago
- Copied to action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M added