coordination #103962: [saga][epic] Easy multi-machine handling: MM-tests as first-class citizens
Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M
The Ratio of multi-machine tests
iswas alerting between 10.34 and 15.10(?) CET today:
- AC1: Thresholds for ratio_mm_failed are tuned based on concrete data
- AC2: Investigation advice in alert contains concrete steps
- Investigate what caused the ratio to turn
- Check https://openqa.suse.de/tests?resultfilter=Failed and look for a correlation
- Follow https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation
- Document how to better investigate this alert
#7 Updated by cdywan about 2 months ago
- Subject changed from Provide "fail-rate" alerting with ratio_mm_failed 5.280 size:M to Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M
I think you copied the due-date from the clonee ticket hence removing it here and resetting start date as well. By the way, what do you want to say with the subject "Provide … alerting"?
Monitor "fail-ratio" of tests became
Provide "fail-rate" of tests became this 😉️
#12 Updated by mkittler about 2 months ago
- Status changed from Workable to Feedback
SR to remove it completely: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/621
#14 Updated by okurz about 1 month ago
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/625 with some investigation hints in the alert notification message, from this ticket's description
#15 Updated by mkittler about 1 month ago
- Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/622 has been merged (to not remove the alert completely and just use a very high threshold). This is a bit different than AC1 but it seems to be the best we can currently do.
I've also just merged the SR from @okurz. This should cover AC2.