action #99135
coordination #96185: [epic] Multimachine failure rate increased
Provide ratio of tests by result in monitoring - by worker
0%
Description
Related issues
History
#1
Updated by okurz over 1 year ago
- Copied from action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added
#2
Updated by okurz over 1 year ago
- Copied to action #99138: Provide ratio of tests by result in monitoring - by job group added
#3
Updated by okurz over 1 year ago
- Status changed from New to In Progress
- Assignee set to okurz
openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(finished_jobs.id) * 100. / (select count(*) from finished_jobs where result = 'failed'), 2)::numeric(5,2)::float as ratio_failed_by_host from finished_jobs where host is not null group by host order by ratio_failed_by_host desc; host | ratio_failed_by_host ---------------------+---------------------- openqaworker5 | 73.83 openqaworker6 | 65.32 openqaworker8 | 50.8 openqaworker9 | 47.23 openqaworker3 | 44.86 grenache-1 | 43.52 openqaworker13 | 38.54 openqaworker-arm-2 | 38.01 openqaworker2 | 28.82 openqaworker10 | 25.61 openqaworker-arm-1 | 22.74 openqaworker-arm-3 | 20.78 QA-Power8-5-kvm | 19.9 QA-Power8-4-kvm | 19.53 powerqaworker-qam-1 | 17.16 malbec | 10.62 automotive-3 | 0.83 openqaworker-arm-5 | 0.2 openqaworker-arm-4 | 0.16 openqaworker11 | 0.03 (20 rows) Time: 28402.305 ms (00:28.402)
#4
Updated by openqa_review over 1 year ago
- Due date set to 2021-10-09
Setting due date based on mean cycle time of SUSE QE Tools
#5
Updated by okurz over 1 year ago
The above does not look right. I don't think count(*)
is doing what we want.
Checking numbers manually:
openqa=> select count(jobs.id) from jobs left join workers on jobs.assigned_worker_id = workers.id where host = 'openqaworker5'; count ------- 69322 (1 row) openqa=> select count(jobs.id) from jobs left join workers on jobs.assigned_worker_id = workers.id where host = 'openqaworker5' and result = 'failed'; count ------- 8873 (1 row) openqa=> select 8873 * 100./ 69322 ?column? --------------------- 12.7996884106055798 (1 row)
so this is better:
openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none' and host ~ 'openqaworker5') select count(finished_jobs.id) * 100. / (select count(j.id) from finished_jobs as j) from finished_jobs where result='failed'; ?column? --------------------- 12.7991345113595384 (1 row)
and now try to group by host again
openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(finished_jobs.id) * 100. / (select count(j.id) from finished_jobs as j), 2)::numeric(5,2)::float as ratio_failed_by_host from finished_jobs where host is not null group by host order by ratio_failed_by_host desc; host | ratio_failed_by_host ---------------------+---------------------- openqaworker5 | 8.09 openqaworker6 | 7.13 openqaworker8 | 5.55 openqaworker9 | 5.16 grenache-1 | 4.93 openqaworker3 | 4.9 openqaworker13 | 4.22 openqaworker-arm-2 | 4.13 openqaworker2 | 3.26 openqaworker10 | 2.83 openqaworker-arm-1 | 2.46 openqaworker-arm-3 | 2.25 QA-Power8-5-kvm | 2.2 QA-Power8-4-kvm | 2.17 powerqaworker-qam-1 | 1.9 malbec | 1.17 automotive-3 | 0.04 openqaworker-arm-5 | 0.04 openqaworker-arm-4 | 0.04 openqaworker11 | 0 (20 rows)
hm, numbers for openqaworker5 don't match. That's calculating "failed on this host divided by number of all jobs", right? But I can't put another group by host
into the subquery in the division. Not sure what to do here.
#6
Updated by okurz over 1 year ago
mkittler helped me. The relevant keyword to look for is "aggregate functions", https://www.postgresql.org/docs/current/functions-aggregate.html . Looks good now, except for naming.
openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select sum(case when result='failed' then 1 else 0 end) * 100. / count(id), host from finished group by host; ?column? | host ------------------------+--------------------- 4.2294612883819949 | 14.7977107887579329 | openqaworker-arm-2 13.4478796741174013 | openqaworker-arm-3 11.1542000788747207 | openqaworker6 28.5359801488833747 | openqaworker-arm-5 10.8831485959158268 | openqaworker3 0.00000000000000000000 | openqaworker11 32.5355594568449715 | openqaworker2 14.4309348216065378 | malbec 11.5793056501021103 | openqaworker9 13.9535532075050880 | openqaworker13 12.7698128524419060 | openqaworker5 16.2826128356503523 | openqaworker-arm-1 11.7461503341219441 | openqaworker10 11.5025709388687869 | openqaworker8 17.2077922077922078 | automotive-3 14.2372463142310777 | powerqaworker-qam-1 17.0978870202673566 | QA-Power8-4-kvm 29.2534701695476365 | grenache-1 30.1675977653631285 | openqaworker-arm-4 13.8192295379361861 | QA-Power8-5-kvm (21 rows)
maybe quicker+easier to use count(*)
in division:
openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select sum(case when result='failed' then 1 else 0 end) * 100. / count(*), host from finished group by host; ?column? | host ------------------------+--------------------- 4.2296854153938230 | 14.8001246564863869 | openqaworker-arm-2 13.4471773982975612 | openqaworker-arm-3 11.1543833703064662 | openqaworker6 28.5359801488833747 | openqaworker-arm-5 10.8825782970979791 | openqaworker3 0.00000000000000000000 | openqaworker11 32.5380619738491850 | openqaworker2 14.4309348216065378 | malbec 11.5793056501021103 | openqaworker9 13.9563411302238702 | openqaworker13 12.7680241239833568 | openqaworker5 16.2826128356503523 | openqaworker-arm-1 11.7456628206192413 | openqaworker10 11.4983389407308661 | openqaworker8 17.2077922077922078 | automotive-3 14.2372463142310777 | powerqaworker-qam-1 17.1068233265064137 | QA-Power8-4-kvm 29.2584886030454932 | grenache-1 30.1675977653631285 | openqaworker-arm-4 13.8238275622899216 | QA-Power8-5-kvm (21 rows) Time: 3080.002 ms (00:03.080)
the "ternary operator" is nice but there seems to be another option which we consider even nicer which is the "filter" option for any aggregate expression, see https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES:
openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select count(*) filter (where result='failed') * 100. / count(*), host from finished group by host; ?column? | host ------------------------+--------------------- 4.2286704416902216 | 14.8001246564863869 | openqaworker-arm-2 13.4457730666805911 | openqaworker-arm-3 11.1576905372525810 | openqaworker6 28.4653465346534653 | openqaworker-arm-5 10.8867490658235125 | openqaworker3 0.00000000000000000000 | openqaworker11 32.5500591037718953 | openqaworker2 14.4479872459147071 | malbec 11.5840370471261237 | openqaworker9 13.9578976718248989 | openqaworker13 12.7751515327281691 | openqaworker5 16.2826128356503523 | openqaworker-arm-1 11.7441860465116279 | openqaworker10 11.5027200948328782 | openqaworker8 17.2077922077922078 | automotive-3 14.2425363927954602 | powerqaworker-qam-1 17.1122129998922065 | QA-Power8-4-kvm 29.2573941414284701 | grenache-1 30.1675977653631285 | openqaworker-arm-4 13.8238275622899216 | QA-Power8-5-kvm (21 rows) Time: 3096.322 ms (00:03.096)
so the final query I propose even skipping the id as we just count rows is
openqa=> with finished as (select result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host is not null group by host order by ratio_failed_by_host desc; host | ratio_failed_by_host ---------------------+---------------------- openqaworker2 | 32.55 openqaworker-arm-4 | 30.36 grenache-1 | 29.26 openqaworker-arm-5 | 28.47 automotive-3 | 17.21 QA-Power8-4-kvm | 17.12 openqaworker-arm-1 | 16.29 openqaworker-arm-2 | 14.8 malbec | 14.45 powerqaworker-qam-1 | 14.25 openqaworker13 | 13.96 QA-Power8-5-kvm | 13.82 openqaworker-arm-3 | 13.45 openqaworker5 | 12.78 openqaworker10 | 11.74 openqaworker9 | 11.59 openqaworker8 | 11.5 openqaworker6 | 11.16 openqaworker3 | 10.89 openqaworker11 | 0 (20 rows) Time: 2733.732 ms (00:02.734)
created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/593
#7
Updated by okurz over 1 year ago
- Status changed from In Progress to Feedback
#8
Updated by okurz over 1 year ago
telegraf changes are in, now grafana: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/594
#9
Updated by okurz over 1 year ago
- Status changed from Feedback to Resolved
#10
Updated by okurz over 1 year ago
- Due date deleted (
2021-10-09)