action #103425
closedcoordination #103962: [saga][epic] Easy multi-machine handling: MM-tests as first-class citizens
Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M
Description
Observation¶
The Ratio of multi-machine tests iswas alerting between 10.34 and 15.10(?) CET today:
ratio_mm_failed 5.280
Acceptance criteria¶
- AC1: Thresholds for ratio_mm_failed are tuned based on concrete data
- AC2: Investigation advice in alert contains concrete steps
Suggestion¶
- Investigate what caused the ratio to turn
- Check https://openqa.suse.de/tests?resultfilter=Failed and look for a correlation
- Follow https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation
- Document how to better investigate this alert
Updated by livdywan almost 3 years ago
- Copied from action #102428: Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M added
Updated by livdywan almost 3 years ago
And alterting again right now:
ratio_mm_failed 5.910
Updated by livdywan almost 3 years ago
cdywan wrote:
And alterting again right now:
ratio_mm_failed 5.910
It's OK again
Updated by livdywan almost 3 years ago
- Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.280 to Provide "fail-rate" alerting with ratio_mm_failed 5.280 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz almost 3 years ago
- Due date deleted (
2021-12-07) - Start date deleted (
2021-07-28)
I think you copied the due-date from the clonee ticket hence removing it here and resetting start date as well. By the way, what do you want to say with the subject "Provide … alerting"?
Updated by livdywan almost 3 years ago
- Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.280 size:M to Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M
okurz wrote:
I think you copied the due-date from the clonee ticket hence removing it here and resetting start date as well. By the way, what do you want to say with the subject "Provide … alerting"?
Monitor "fail-ratio" of tests
became Provide "fail-rate" of tests
became this 😉️
Updated by livdywan almost 3 years ago
- Priority changed from Normal to High
Alerting now:
ratio_mm_failed 5.080
I think it's safe to say we've reached alert fatigue. And we're not even clear what caused or resolved the previous occurences. Hence raising prio.
Updated by okurz almost 3 years ago
- Related to action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M added
Updated by okurz almost 3 years ago
- Related to action #71809: Enable multi-machine jobs trigger without "isos post" added
Updated by mkittler almost 3 years ago
- Status changed from Workable to Feedback
SR to remove it completely: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/621
Updated by okurz almost 3 years ago
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/625 with some investigation hints in the alert notification message, from this ticket's description
Updated by mkittler almost 3 years ago
- Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/622 has been merged (to not remove the alert completely and just use a very high threshold). This is a bit different than AC1 but it seems to be the best we can currently do.
I've also just merged the SR from @okurz. This should cover AC2.