Project

General

Profile

action #96191

Provide "fail-rate" of tests, especially multi-machine, in grafana size:M

Added by okurz 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-07-28
Due date:
2021-09-29
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

The hypothesis was raised that "multimachine jobs have decreased reliability since ~2 weeks (2 nodes). More nodes are even worse." Maybe true, maybe not. We should be able to calculate a fail-ratio for different categories of openQA tests, e.g. in grafana based on SQL queries. With this we would be able to support/reject the hypothesis.

Suggestion

  • See what grafana data we have, or SQL queries, extend as needed
  • Consider mm versus "normal" tests
  • Focus on failed start with - we already deal with incompletes
  • Exclude retried jobs since those don't run for mm

Related issues

Related to openQA Project - coordination #96185: [epic] Multimachine failure rate increasedBlocked2021-07-292021-10-09

Related to openQA Project - action #98604: Provide data about ratio of automatically approved SLE Maintenance incidents size:MResolved2021-09-142021-10-12

Copied to openQA Project - action #99135: Provide ratio of tests by result in monitoring - by workerResolved2021-10-09

History

#1 Updated by okurz 3 months ago

#2 Updated by okurz 2 months ago

  • Target version changed from future to Ready

With #96260 done we can do this now

#3 Updated by cdywan about 2 months ago

  • Subject changed from Monitor "fail-ratio" of tests, especially multi-machine tests, to have data backing (or disproofing) claims that "multi-machine tests become more unstable" to Provide "fail-rate" of tests, especially multi-machine, in grafana size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by okurz about 1 month ago

  • Due date set to 2021-09-29
  • Status changed from Workable to In Progress
  • Assignee set to okurz

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/575 to get an initial fail-ratio into influxdb which in grafana we can calculate into a time-based fail-rate for all tests. Regarding multi-machine tests I tried to get all multi-machine tests with:

select jobs.id,state,result_dir,key,value,jobs.t_created from jobs left join job_settings on jobs.id = job_settings.job_id and key = 'PARALLEL_WITH' order by jobs.id desc limit 10;

Which shows me many jobs but they don't seem to have the job setting "PARALLEL_WITH", where do I fail?

EDIT: mkittler helped. I should look into "jobs_dependencies". Trying something like:

openqa=> select (select count(id) from jobs where result = 'failed' and t_created >= (NOW() - interval '24 hour') and ((select id from job_dependencies where (id = child_job_id or id = parent_job_id) and dependency = 2 limit 1) is not null)) * 100. / (select count(id) from jobs where ((select id from job_dependencies where (id = child_job_id or id = parent_job_id) and dependency = 2 limit 1) is not null));
        ?column?        
------------------------
 0.00299605519399457381
(1 row)

which should be the percentage of failed multi-machine tests of all multi-machine tests. Non-optimal is that it takes multiple seconds so it's quite costly. So I hope someone can help to optimize the query.

#5 Updated by okurz about 1 month ago

  • Related to action #98604: Provide data about ratio of automatically approved SLE Maintenance incidents size:M added

#6 Updated by okurz about 1 month ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/578 for a monitoring panel with alert for "fail-ratio" of generic tests. For multi-machine tests IMHO we need to find a better query than the above to identify the multi-machine fail-ratio. Who has an idea how we can have a faster, more optimized SQL query?

#7 Updated by okurz about 1 month ago

What mkittler, nsinger and me found is

openqa=> \timing
Timing is on.
openqa=> with mm_jobs as (select distinct id, result from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select result, round(count(id) * 100. / (select count(id) from mm_jobs), 2)::numeric(5,2)::float as ratio from mm_jobs group by mm_jobs.result order by ratio desc;
       result       | ratio 
--------------------+-------
 obsoleted          | 44.56
 passed             | 34.02
 skipped            |  6.54
 parallel_failed    |  5.65
 parallel_restarted |  3.01
 failed             |  2.28
 softfailed         |  2.18
 incomplete         |  1.41
 none               |  0.16
 user_cancelled     |   0.1
 timeout_exceeded   |  0.09
 user_restarted     |  0.01
(12 rows)

Time: 3866.327 ms (00:03.866)

which is nice.

We can filter by a shorter time to speedup, e.g.

with mm_jobs as (select distinct id, result from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where t_created >= (select now() at time zone 'utc' - interval '1 hour') and dependency = 2) select result, round(count(id) * 100. / (select count(id) from mm_jobs), 2)::numeric(5,2)::float as ratio from mm_jobs group by mm_jobs.result order by ratio desc;
 result  | ratio 
---------+-------
 none    | 99.14
 skipped |  0.86
(2 rows)

keep in mind that we already did select now() in other cases which causes wrong results for postgreSQL. So we should fix that to select now() at time zone 'utc'

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/583 with fixes and new panels and alerts.

Next steps or ideas for the future:

  • Make colors a bit more compatible to the other panels
  • Add additional long-time queries for long-time statistics, e.g. all jobs, not limited to time-window and include in high-level views, e.g. on index page of grafana -> Understand general team performances
  • Group by workers (and job groups and machines and archs?) and according alerts -> Find misbehaving workers (overall fail-rate as well as multi-machine specific fail-rate)

#8 Updated by okurz about 1 month ago

okurz wrote:

  • Add additional long-time queries for long-time statistics, e.g. all jobs, not limited to time-window and include in high-level views, e.g. on index page of grafana -> Understand general team performances

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/585 to collect the data. After acceptance needs to be included in grafana

  • Group by workers (and job groups and machines and archs?) and according alerts -> Find misbehaving workers (overall fail-rate as well as multi-machine specific fail-rate)

I wonder if it's a good idea trying to do a single query including everything, e.g.

select result, job_groups.name, machine, host, round(count(jobs.id) * 100. / (select count(jobs.id) from jobs), 2)::numeric(5,2)::float as ratio_all_long_term from jobs left join job_groups on jobs.group_id = job_groups.id left join workers on jobs.assigned_worker_id = workers.id group by result, job_groups.name, machine, host;

taking roughly 5s on osd, returning 14k rows. But I don't know if we can afterwards sum up in grafana, otherwise we would have 14k different values which all round to 0.00 or something. I guess separate queries are required here?

#9 Updated by okurz about 1 month ago

For the long-term statistics which I configured to run only every 24h there is no data yet. Maybe the first time will only be triggered after 24h, not initially on service start. Waiting …

#10 Updated by okurz about 1 month ago

running out of patience. I manually changed the interval on osd from 24h to 2m, collected some samples, created grafana panels and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/587

#11 Updated by okurz about 1 month ago

  • Status changed from Feedback to Resolved

graphs are live on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-15m&to=now

Some current results:

  • 11% of all tests fail vs. 40%+8%=48% passed+softfailed so at least 4x more "ok" results compared to failed
  • 25% of all tests are obsoleted. I consider this quite high and shows that our processes focus more about failed tests and not about actually finishing tests or ensuring test coverage
  • 2% of all multi-machine tests fail vs. 34%+2%=36% passed+softfailed so 18x more "ok" results compared to failed
  • 44% of all multi-machine tests are obsoleted which I consider kinda crazy

But our results could be skewed if for example for some reason we store obsoleted tests longer than failed or passed. In general we already know that failed tests are much more often considered "important" and hence kept around longer so in reality the ratio of passed vs. failed is likely much higher.

As especially for QA maintenance tests we can not preserve a long history we can not currently say if the multi-machine failure ratio increased but I expect that we are better setup for the future now.

#12 Updated by okurz about 1 month ago

  • Copied to action #99135: Provide ratio of tests by result in monitoring - by worker added

Also available in: Atom PDF