action #162323
closedno alert about multi-machine test failures 2024-06-14 size:S
0%
Description
Observation¶
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1718145766820&to=1718474617885 shows significant (too) high failed+parallel_failed jobs. But no alert was triggered. We should make our alerts trigger in such situations.
Suggestion¶
- Check why our existing alerts did not work
- https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1718145766820&to=1718474617885&editPanel=24&tab=alert We have 2 alerts linked to that panel
- Confirm what those failures were (if possible)
- https://monitor.qa.suse.de/alerting/list?search=incomplete
Updated by okurz 4 months ago
- Copied from action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry added
Updated by okurz 4 months ago
- Status changed from New to Rejected
- Assignee set to okurz
- Priority changed from High to Normal
nevermind. I guess with the failing https://gitlab.suse.de/openqa/scripts-ci/-/pipelines we are still informed enough?
Updated by okurz 4 months ago
- Tags set to infra, monitoring, alert, multi-machine
- Status changed from Rejected to New
- Assignee deleted (
okurz)
no, scripts-ci tests can not uncover all problems as they might not run certain worker combinations. I think an alert in grafana is helpful and would be workable for us.
Updated by tinita about 1 month ago
- Target version changed from Tools - Next to Ready
Updated by mkittler 26 days ago ยท Edited
The ratio of failed MM tests was 26 % and our alert threshold is 30 %. I could lower the threshold to e.g. 20 %. Otherwise I don't think there's anything wrong with the queries we use for alerting and they are in-line with the panel queries.
Note that there might be some confusion, though: On the graph linked in the ticket description the ratio of failed MM jobs look much higher when also considering jobs with the result parallel_failed
. It doesn't make much sense to consider those jobs but it might be something one accidentally does because parallel_failed
is also shown in red and only in a slightly different shade of red than the failed
. If the ticket was only created due to this confusion we might not want to change the alert threshold but instead use a different color for prallel_failed
in the panel (e.g. some gray).
Updated by okurz 26 days ago
- Status changed from Feedback to Resolved
merged. This should suffice. Verified on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1718145766820&to=1718474617885 . Thx