action #162323: no alert about multi-machine test failures 2024-06-14 size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #162323

closed

no alert about multi-machine test failures 2024-06-14 size:S

Added by okurz 10 months ago. Updated 7 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-06-15

Due date:

% Done:

Estimated time:

Tags:

multi-machine, alert, monitoring, infra

Description

Observation¶

https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1718145766820&to=1718474617885 shows significant (too) high failed+parallel_failed jobs. But no alert was triggered. We should make our alerts trigger in such situations.

Suggestion¶

Check why our existing alerts did not work
- https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1718145766820&to=1718474617885&editPanel=24&tab=alert We have 2 alerts linked to that panel
Confirm what those failures were (if possible)
https://monitor.qa.suse.de/alerting/list?search=incomplete

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 10 months ago

Copied from action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry added

Actions

Copy link

Updated by okurz 10 months ago

Status changed from New to Rejected
Assignee set to okurz
Priority changed from High to Normal

nevermind. I guess with the failing https://gitlab.suse.de/openqa/scripts-ci/-/pipelines we are still informed enough?

Actions

Copy link

Updated by okurz 10 months ago

Tags set to infra, monitoring, alert, multi-machine
Status changed from Rejected to New
Assignee deleted (~~okurz~~)

no, scripts-ci tests can not uncover all problems as they might not run certain worker combinations. I think an alert in grafana is helpful and would be workable for us.

Actions

Copy link

Updated by okurz 10 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

Updated by livdywan 9 months ago

Subject changed from no alert about multi-machine test failures 2024-06-14+ to no alert about multi-machine test failures 2024-06-14 size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by tinita 8 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by mkittler 7 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 7 months ago · Edited

The ratio of failed MM tests was 26 % and our alert threshold is 30 %. I could lower the threshold to e.g. 20 %. Otherwise I don't think there's anything wrong with the queries we use for alerting and they are in-line with the panel queries.

Note that there might be some confusion, though: On the graph linked in the ticket description the ratio of failed MM jobs look much higher when also considering jobs with the result parallel_failed. It doesn't make much sense to consider those jobs but it might be something one accidentally does because parallel_failed is also shown in red and only in a slightly different shade of red than the failed. If the ticket was only created due to this confusion we might not want to change the alert threshold but instead use a different color for prallel_failed in the panel (e.g. some gray).

Actions

Copy link