Project

General

Profile

Actions

action #162323

closed

no alert about multi-machine test failures 2024-06-14 size:S

Added by okurz 6 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-15
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1718145766820&to=1718474617885 shows significant (too) high failed+parallel_failed jobs. But no alert was triggered. We should make our alerts trigger in such situations.

Suggestion


Related issues 1 (0 open1 closed)

Copied from openQA Project (public) - action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retryResolvedokurz2024-06-15

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry added
Actions #2

Updated by okurz 6 months ago

  • Status changed from New to Rejected
  • Assignee set to okurz
  • Priority changed from High to Normal

nevermind. I guess with the failing https://gitlab.suse.de/openqa/scripts-ci/-/pipelines we are still informed enough?

Actions #3

Updated by okurz 6 months ago

  • Tags set to infra, monitoring, alert, multi-machine
  • Status changed from Rejected to New
  • Assignee deleted (okurz)

no, scripts-ci tests can not uncover all problems as they might not run certain worker combinations. I think an alert in grafana is helpful and would be workable for us.

Actions #4

Updated by okurz 6 months ago

  • Target version changed from Ready to Tools - Next
Actions #5

Updated by livdywan 4 months ago

  • Subject changed from no alert about multi-machine test failures 2024-06-14+ to no alert about multi-machine test failures 2024-06-14 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by tinita 3 months ago

  • Target version changed from Tools - Next to Ready
Actions #7

Updated by mkittler 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #8

Updated by mkittler 3 months ago ยท Edited

The ratio of failed MM tests was 26 % and our alert threshold is 30 %. I could lower the threshold to e.g. 20 %. Otherwise I don't think there's anything wrong with the queries we use for alerting and they are in-line with the panel queries.

Note that there might be some confusion, though: On the graph linked in the ticket description the ratio of failed MM jobs look much higher when also considering jobs with the result parallel_failed. It doesn't make much sense to consider those jobs but it might be something one accidentally does because parallel_failed is also shown in red and only in a slightly different shade of red than the failed. If the ticket was only created due to this confusion we might not want to change the alert threshold but instead use a different color for prallel_failed in the panel (e.g. some gray).

Actions #9

Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback
Actions #10

Updated by okurz 3 months ago

Yes, I think we should improve the colors. Give it a try

Actions #12

Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF