action #96191
closedProvide "fail-rate" of tests, especially multi-machine, in grafana size:M
Description
Motivation¶
The hypothesis was raised that "multimachine jobs have decreased reliability since ~2 weeks (2 nodes). More nodes are even worse." Maybe true, maybe not. We should be able to calculate a fail-ratio for different categories of openQA tests, e.g. in grafana based on SQL queries. With this we would be able to support/reject the hypothesis.
Suggestion¶
- See what grafana data we have, or SQL queries, extend as needed
- Consider mm versus "normal" tests
- Focus on failed start with - we already deal with incompletes
- Exclude retried jobs since those don't run for mm
Updated by okurz over 3 years ago
- Related to coordination #96185: [epic] Multimachine failure rate increased added
Updated by okurz over 3 years ago
- Target version changed from future to Ready
With #96260 done we can do this now
Updated by livdywan over 3 years ago
- Subject changed from Monitor "fail-ratio" of tests, especially multi-machine tests, to have data backing (or disproofing) claims that "multi-machine tests become more unstable" to Provide "fail-rate" of tests, especially multi-machine, in grafana size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 3 years ago
- Due date set to 2021-09-29
- Status changed from Workable to In Progress
- Assignee set to okurz
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/575 to get an initial fail-ratio into influxdb which in grafana we can calculate into a time-based fail-rate for all tests. Regarding multi-machine tests I tried to get all multi-machine tests with:
select jobs.id,state,result_dir,key,value,jobs.t_created from jobs left join job_settings on jobs.id = job_settings.job_id and key = 'PARALLEL_WITH' order by jobs.id desc limit 10;
Which shows me many jobs but they don't seem to have the job setting "PARALLEL_WITH", where do I fail?
EDIT: mkittler helped. I should look into "jobs_dependencies". Trying something like:
openqa=> select (select count(id) from jobs where result = 'failed' and t_created >= (NOW() - interval '24 hour') and ((select id from job_dependencies where (id = child_job_id or id = parent_job_id) and dependency = 2 limit 1) is not null)) * 100. / (select count(id) from jobs where ((select id from job_dependencies where (id = child_job_id or id = parent_job_id) and dependency = 2 limit 1) is not null));
?column?
------------------------
0.00299605519399457381
(1 row)
which should be the percentage of failed multi-machine tests of all multi-machine tests. Non-optimal is that it takes multiple seconds so it's quite costly. So I hope someone can help to optimize the query.
Updated by okurz about 3 years ago
- Related to action #98604: Provide data about ratio of automatically approved SLE Maintenance incidents size:M added
Updated by okurz about 3 years ago
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/578 for a monitoring panel with alert for "fail-ratio" of generic tests. For multi-machine tests IMHO we need to find a better query than the above to identify the multi-machine fail-ratio. Who has an idea how we can have a faster, more optimized SQL query?
Updated by okurz about 3 years ago
What mkittler, nsinger and me found is
openqa=> \timing
Timing is on.
openqa=> with mm_jobs as (select distinct id, result from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select result, round(count(id) * 100. / (select count(id) from mm_jobs), 2)::numeric(5,2)::float as ratio from mm_jobs group by mm_jobs.result order by ratio desc;
result | ratio
--------------------+-------
obsoleted | 44.56
passed | 34.02
skipped | 6.54
parallel_failed | 5.65
parallel_restarted | 3.01
failed | 2.28
softfailed | 2.18
incomplete | 1.41
none | 0.16
user_cancelled | 0.1
timeout_exceeded | 0.09
user_restarted | 0.01
(12 rows)
Time: 3866.327 ms (00:03.866)
which is nice.
We can filter by a shorter time to speedup, e.g.
with mm_jobs as (select distinct id, result from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where t_created >= (select now() at time zone 'utc' - interval '1 hour') and dependency = 2) select result, round(count(id) * 100. / (select count(id) from mm_jobs), 2)::numeric(5,2)::float as ratio from mm_jobs group by mm_jobs.result order by ratio desc;
result | ratio
---------+-------
none | 99.14
skipped | 0.86
(2 rows)
keep in mind that we already did select now()
in other cases which causes wrong results for postgreSQL. So we should fix that to select now() at time zone 'utc'
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/583 with fixes and new panels and alerts.
Next steps or ideas for the future:
- Make colors a bit more compatible to the other panels
- Add additional long-time queries for long-time statistics, e.g. all jobs, not limited to time-window and include in high-level views, e.g. on index page of grafana -> Understand general team performances
- Group by workers (and job groups and machines and archs?) and according alerts -> Find misbehaving workers (overall fail-rate as well as multi-machine specific fail-rate)
Updated by okurz about 3 years ago
okurz wrote:
- Add additional long-time queries for long-time statistics, e.g. all jobs, not limited to time-window and include in high-level views, e.g. on index page of grafana -> Understand general team performances
Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/585 to collect the data. After acceptance needs to be included in grafana
- Group by workers (and job groups and machines and archs?) and according alerts -> Find misbehaving workers (overall fail-rate as well as multi-machine specific fail-rate)
I wonder if it's a good idea trying to do a single query including everything, e.g.
select result, job_groups.name, machine, host, round(count(jobs.id) * 100. / (select count(jobs.id) from jobs), 2)::numeric(5,2)::float as ratio_all_long_term from jobs left join job_groups on jobs.group_id = job_groups.id left join workers on jobs.assigned_worker_id = workers.id group by result, job_groups.name, machine, host;
taking roughly 5s on osd, returning 14k rows. But I don't know if we can afterwards sum up in grafana, otherwise we would have 14k different values which all round to 0.00 or something. I guess separate queries are required here?
Updated by okurz about 3 years ago
For the long-term statistics which I configured to run only every 24h there is no data yet. Maybe the first time will only be triggered after 24h, not initially on service start. Waiting …
Updated by okurz about 3 years ago
running out of patience. I manually changed the interval on osd from 24h to 2m, collected some samples, created grafana panels and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/587
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved
graphs are live on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-15m&to=now
Some current results:
- 11% of all tests fail vs. 40%+8%=48% passed+softfailed so at least 4x more "ok" results compared to failed
- 25% of all tests are obsoleted. I consider this quite high and shows that our processes focus more about failed tests and not about actually finishing tests or ensuring test coverage
- 2% of all multi-machine tests fail vs. 34%+2%=36% passed+softfailed so 18x more "ok" results compared to failed
- 44% of all multi-machine tests are obsoleted which I consider kinda crazy
But our results could be skewed if for example for some reason we store obsoleted tests longer than failed or passed. In general we already know that failed tests are much more often considered "important" and hence kept around longer so in reality the ratio of passed vs. failed is likely much higher.
As especially for QA maintenance tests we can not preserve a long history we can not currently say if the multi-machine failure ratio increased but I expect that we are better setup for the future now.
Updated by okurz about 3 years ago
- Copied to action #99135: Provide ratio of tests by result in monitoring - by worker added
Updated by livdywan about 3 years ago
- Copied to action #102428: Provide "fail-rate" alerting with ratio_mm_failed 5.360 size:M added