action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #96191

closed

Provide "fail-rate" of tests, especially multi-machine, in grafana size:M

Added by okurz over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Feature requests

Target version:

Ready

Start date:

2021-07-28

Due date:

2021-09-29

% Done:

Estimated time:

Description

Motivation¶

The hypothesis was raised that "multimachine jobs have decreased reliability since ~2 weeks (2 nodes). More nodes are even worse." Maybe true, maybe not. We should be able to calculate a fail-ratio for different categories of openQA tests, e.g. in grafana based on SQL queries. With this we would be able to support/reject the hypothesis.

Suggestion¶

See what grafana data we have, or SQL queries, extend as needed
Consider mm versus "normal" tests
Focus on failed start with - we already deal with incompletes
Exclude retried jobs since those don't run for mm

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz over 3 years ago

Related to coordination #96185: [epic] Multimachine failure rate increased added

Actions

Copy link

Updated by okurz about 3 years ago

Target version changed from future to Ready

With #96260 done we can do this now

Actions

Copy link

Updated by livdywan about 3 years ago

Subject changed from Monitor "fail-ratio" of tests, especially multi-machine tests, to have data backing (or disproofing) claims that "multi-machine tests become more unstable" to Provide "fail-rate" of tests, especially multi-machine, in grafana size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz about 3 years ago

Due date set to 2021-09-29
Status changed from Workable to In Progress
Assignee set to okurz

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/575 to get an initial fail-ratio into influxdb which in grafana we can calculate into a time-based fail-rate for all tests. Regarding multi-machine tests I tried to get all multi-machine tests with:

select jobs.id,state,result_dir,key,value,jobs.t_created from jobs left join job_settings on jobs.id = job_settings.job_id and key = 'PARALLEL_WITH' order by jobs.id desc limit 10;

Which shows me many jobs but they don't seem to have the job setting "PARALLEL_WITH", where do I fail?

EDIT: mkittler helped. I should look into "jobs_dependencies". Trying something like:

openqa=> select (select count(id) from jobs where result = 'failed' and t_created >= (NOW() - interval '24 hour') and ((select id from job_dependencies where (id = child_job_id or id = parent_job_id) and dependency = 2 limit 1) is not null)) * 100. / (select count(id) from jobs where ((select id from job_dependencies where (id = child_job_id or id = parent_job_id) and dependency = 2 limit 1) is not null));
        ?column?        
------------------------
 0.00299605519399457381
(1 row)

which should be the percentage of failed multi-machine tests of all multi-machine tests. Non-optimal is that it takes multiple seconds so it's quite costly. So I hope someone can help to optimize the query.

Actions

Copy link

Updated by okurz about 3 years ago

Related to action #98604: Provide data about ratio of automatically approved SLE Maintenance incidents size:M added

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/578 for a monitoring panel with alert for "fail-ratio" of generic tests. For multi-machine tests IMHO we need to find a better query than the above to identify the multi-machine fail-ratio. Who has an idea how we can have a faster, more optimized SQL query?

Actions

Copy link

Updated by okurz about 3 years ago

What mkittler, nsinger and me found is

openqa=> \timing
Timing is on.
openqa=> with mm_jobs as (select distinct id, result from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select result, round(count(id) * 100. / (select count(id) from mm_jobs), 2)::numeric(5,2)::float as ratio from mm_jobs group by mm_jobs.result order by ratio desc;
       result       | ratio 
--------------------+-------
 obsoleted          | 44.56
 passed             | 34.02
 skipped            |  6.54
 parallel_failed    |  5.65
 parallel_restarted |  3.01
 failed             |  2.28
 softfailed         |  2.18
 incomplete         |  1.41
 none               |  0.16
 user_cancelled     |   0.1
 timeout_exceeded   |  0.09
 user_restarted     |  0.01
(12 rows)

Time: 3866.327 ms (00:03.866)

which is nice.

We can filter by a shorter time to speedup, e.g.

with mm_jobs as (select distinct id, result from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where t_created >= (select now() at time zone 'utc' - interval '1 hour') and dependency = 2) select result, round(count(id) * 100. / (select count(id) from mm_jobs), 2)::numeric(5,2)::float as ratio from mm_jobs group by mm_jobs.result order by ratio desc;
 result  | ratio 
---------+-------
 none    | 99.14
 skipped |  0.86
(2 rows)

keep in mind that we already did select now() in other cases which causes wrong results for postgreSQL. So we should fix that to select now() at time zone 'utc'

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/583 with fixes and new panels and alerts.

Next steps or ideas for the future:

Make colors a bit more compatible to the other panels
Add additional long-time queries for long-time statistics, e.g. all jobs, not limited to time-window and include in high-level views, e.g. on index page of grafana -> Understand general team performances
Group by workers (and job groups and machines and archs?) and according alerts -> Find misbehaving workers (overall fail-rate as well as multi-machine specific fail-rate)

Actions

Copy link

Updated by okurz about 3 years ago

okurz wrote:

Add additional long-time queries for long-time statistics, e.g. all jobs, not limited to time-window and include in high-level views, e.g. on index page of grafana -> Understand general team performances

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/585 to collect the data. After acceptance needs to be included in grafana

Group by workers (and job groups and machines and archs?) and according alerts -> Find misbehaving workers (overall fail-rate as well as multi-machine specific fail-rate)

I wonder if it's a good idea trying to do a single query including everything, e.g.

select result, job_groups.name, machine, host, round(count(jobs.id) * 100. / (select count(jobs.id) from jobs), 2)::numeric(5,2)::float as ratio_all_long_term from jobs left join job_groups on jobs.group_id = job_groups.id left join workers on jobs.assigned_worker_id = workers.id group by result, job_groups.name, machine, host;

taking roughly 5s on osd, returning 14k rows. But I don't know if we can afterwards sum up in grafana, otherwise we would have 14k different values which all round to 0.00 or something. I guess separate queries are required here?

Actions

Copy link

Updated by okurz about 3 years ago

For the long-term statistics which I configured to run only every 24h there is no data yet. Maybe the first time will only be triggered after 24h, not initially on service start. Waiting …

Actions

Copy link

#10

Updated by okurz about 3 years ago

running out of patience. I manually changed the interval on osd from 24h to 2m, collected some samples, created grafana panels and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/587

Actions

Copy link

#11

Updated by okurz about 3 years ago

Status changed from Feedback to Resolved

graphs are live on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-15m&to=now

Some current results:

11% of all tests fail vs. 40%+8%=48% passed+softfailed so at least 4x more "ok" results compared to failed
25% of all tests are obsoleted. I consider this quite high and shows that our processes focus more about failed tests and not about actually finishing tests or ensuring test coverage
2% of all multi-machine tests fail vs. 34%+2%=36% passed+softfailed so 18x more "ok" results compared to failed
44% of all multi-machine tests are obsoleted which I consider kinda crazy

But our results could be skewed if for example for some reason we store obsoleted tests longer than failed or passed. In general we already know that failed tests are much more often considered "important" and hence kept around longer so in reality the ratio of passed vs. failed is likely much higher.

As especially for QA maintenance tests we can not preserve a long history we can not currently say if the multi-machine failure ratio increased but I expect that we are better setup for the future now.

Actions

Copy link

#12

Updated by okurz about 3 years ago

Copied to action #99135: Provide ratio of tests by result in monitoring - by worker added

Actions

Copy link

#13

Updated by livdywan almost 3 years ago

Copied to action #102428: Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #96191

Provide "fail-rate" of tests, especially multi-machine, in grafana size:M

Motivation¶

Suggestion¶

Updated by okurz over 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan almost 3 years ago