Project

General

Profile

action #103425

coordination #103962: [saga][epic] Easy multi-machine handling: MM-tests as first-class citizens

Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M

Added by cdywan about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

The Ratio of multi-machine tests iswas alerting between 10.34 and 15.10(?) CET today:

ratio_mm_failed 5.280

Acceptance criteria

  • AC1: Thresholds for ratio_mm_failed are tuned based on concrete data
  • AC2: Investigation advice in alert contains concrete steps

Suggestion


Related issues

Related to openQA Project - action #95783: Provide support for multi-machine scenarios handled by openqa-investigateIn Progress

Related to openQA Project - action #71809: Enable multi-machine jobs trigger without "isos post"Workable2020-09-24

Copied from openQA Project - action #102428: Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:MResolved2021-07-282021-12-07

History

#1 Updated by cdywan about 2 months ago

  • Copied from action #102428: Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M added

#2 Updated by cdywan about 2 months ago

  • Description updated (diff)

#3 Updated by cdywan about 2 months ago

And alterting again right now:

ratio_mm_failed 5.910

#4 Updated by cdywan about 2 months ago

cdywan wrote:

And alterting again right now:

ratio_mm_failed 5.910

It's OK again

#5 Updated by cdywan about 2 months ago

  • Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.280 to Provide "fail-rate" alerting with ratio_mm_failed 5.280 size:M
  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by okurz about 2 months ago

  • Due date deleted (2021-12-07)
  • Start date deleted (2021-07-28)

I think you copied the due-date from the clonee ticket hence removing it here and resetting start date as well. By the way, what do you want to say with the subject "Provide … alerting"?

#7 Updated by cdywan about 2 months ago

  • Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.280 size:M to Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M

okurz wrote:

I think you copied the due-date from the clonee ticket hence removing it here and resetting start date as well. By the way, what do you want to say with the subject "Provide … alerting"?

Monitor "fail-ratio" of tests became Provide "fail-rate" of tests became this 😉️

#8 Updated by cdywan about 2 months ago

  • Priority changed from Normal to High

Alerting now:

ratio_mm_failed 5.080

I think it's safe to say we've reached alert fatigue. And we're not even clear what caused or resolved the previous occurences. Hence raising prio.

#9 Updated by mkittler about 2 months ago

  • Assignee set to mkittler

Paused the alert for now

#10 Updated by okurz about 2 months ago

  • Related to action #95783: Provide support for multi-machine scenarios handled by openqa-investigate added

#11 Updated by okurz about 2 months ago

  • Related to action #71809: Enable multi-machine jobs trigger without "isos post" added

#12 Updated by mkittler about 2 months ago

  • Status changed from Workable to Feedback

#13 Updated by okurz about 1 month ago

  • Parent task set to #103962

#14 Updated by okurz about 1 month ago

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/625 with some investigation hints in the alert notification message, from this ticket's description

#15 Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/622 has been merged (to not remove the alert completely and just use a very high threshold). This is a bit different than AC1 but it seems to be the best we can currently do.

I've also just merged the SR from @okurz. This should cover AC2.

Also available in: Atom PDF