Project

General

Profile

Actions

action #103425

closed

coordination #103962: [saga][epic] Easy multi-machine handling: MM-tests as first-class citizens

Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M

Added by livdywan almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

The Ratio of multi-machine tests iswas alerting between 10.34 and 15.10(?) CET today:

ratio_mm_failed 5.280

Acceptance criteria

  • AC1: Thresholds for ratio_mm_failed are tuned based on concrete data
  • AC2: Investigation advice in alert contains concrete steps

Suggestion


Related issues 3 (0 open3 closed)

Related to openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:MResolvedmkittler

Actions
Related to openQA Project - action #71809: Enable multi-machine jobs trigger without "isos post"Resolvedmkittler2020-09-24

Actions
Copied from openQA Project - action #102428: Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:MResolvedkraih2021-07-282021-12-07

Actions
Actions #1

Updated by livdywan almost 3 years ago

  • Copied from action #102428: Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M added
Actions #2

Updated by livdywan almost 3 years ago

  • Description updated (diff)
Actions #3

Updated by livdywan almost 3 years ago

And alterting again right now:

ratio_mm_failed 5.910
Actions #4

Updated by livdywan almost 3 years ago

cdywan wrote:

And alterting again right now:

ratio_mm_failed 5.910

It's OK again

Actions #5

Updated by livdywan almost 3 years ago

  • Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.280 to Provide "fail-rate" alerting with ratio_mm_failed 5.280 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz almost 3 years ago

  • Due date deleted (2021-12-07)
  • Start date deleted (2021-07-28)

I think you copied the due-date from the clonee ticket hence removing it here and resetting start date as well. By the way, what do you want to say with the subject "Provide … alerting"?

Actions #7

Updated by livdywan almost 3 years ago

  • Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.280 size:M to Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M

okurz wrote:

I think you copied the due-date from the clonee ticket hence removing it here and resetting start date as well. By the way, what do you want to say with the subject "Provide … alerting"?

Monitor "fail-ratio" of tests became Provide "fail-rate" of tests became this 😉️

Actions #8

Updated by livdywan almost 3 years ago

  • Priority changed from Normal to High

Alerting now:

ratio_mm_failed 5.080

I think it's safe to say we've reached alert fatigue. And we're not even clear what caused or resolved the previous occurences. Hence raising prio.

Actions #9

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler

Paused the alert for now

Actions #10

Updated by okurz almost 3 years ago

  • Related to action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M added
Actions #11

Updated by okurz almost 3 years ago

  • Related to action #71809: Enable multi-machine jobs trigger without "isos post" added
Actions #12

Updated by mkittler almost 3 years ago

  • Status changed from Workable to Feedback
Actions #13

Updated by okurz almost 3 years ago

  • Parent task set to #103962
Actions #14

Updated by okurz almost 3 years ago

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/625 with some investigation hints in the alert notification message, from this ticket's description

Actions #15

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/622 has been merged (to not remove the alert completely and just use a very high threshold). This is a bit different than AC1 but it seems to be the best we can currently do.

I've also just merged the SR from @okurz. This should cover AC2.

Actions

Also available in: Atom PDF