action #99135
closedcoordination #96185: [epic] Multimachine failure rate increased
Provide ratio of tests by result in monitoring - by worker
Description
Updated by okurz about 3 years ago
- Copied from action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added
Updated by okurz about 3 years ago
- Copied to action #99138: Provide ratio of tests by result in monitoring - by job group added
Updated by okurz about 3 years ago
- Status changed from New to In Progress
- Assignee set to okurz
openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(finished_jobs.id) * 100. / (select count(*) from finished_jobs where result = 'failed'), 2)::numeric(5,2)::float as ratio_failed_by_host from finished_jobs where host is not null group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host
---------------------+----------------------
openqaworker5 | 73.83
openqaworker6 | 65.32
openqaworker8 | 50.8
openqaworker9 | 47.23
openqaworker3 | 44.86
grenache-1 | 43.52
openqaworker13 | 38.54
openqaworker-arm-2 | 38.01
openqaworker2 | 28.82
openqaworker10 | 25.61
openqaworker-arm-1 | 22.74
openqaworker-arm-3 | 20.78
QA-Power8-5-kvm | 19.9
QA-Power8-4-kvm | 19.53
powerqaworker-qam-1 | 17.16
malbec | 10.62
automotive-3 | 0.83
openqaworker-arm-5 | 0.2
openqaworker-arm-4 | 0.16
openqaworker11 | 0.03
(20 rows)
Time: 28402.305 ms (00:28.402)
Updated by openqa_review about 3 years ago
- Due date set to 2021-10-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 3 years ago
The above does not look right. I don't think count(*)
is doing what we want.
Checking numbers manually:
openqa=> select count(jobs.id) from jobs left join workers on jobs.assigned_worker_id = workers.id where host = 'openqaworker5';
count
-------
69322
(1 row)
openqa=> select count(jobs.id) from jobs left join workers on jobs.assigned_worker_id = workers.id where host = 'openqaworker5' and result = 'failed';
count
-------
8873
(1 row)
openqa=> select 8873 * 100./ 69322
?column?
---------------------
12.7996884106055798
(1 row)
so this is better:
openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none' and host ~ 'openqaworker5') select count(finished_jobs.id) * 100. / (select count(j.id) from finished_jobs as j) from finished_jobs where result='failed';
?column?
---------------------
12.7991345113595384
(1 row)
and now try to group by host again
openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(finished_jobs.id) * 100. / (select count(j.id) from finished_jobs as j), 2)::numeric(5,2)::float as ratio_failed_by_host from finished_jobs where host is not null group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host
---------------------+----------------------
openqaworker5 | 8.09
openqaworker6 | 7.13
openqaworker8 | 5.55
openqaworker9 | 5.16
grenache-1 | 4.93
openqaworker3 | 4.9
openqaworker13 | 4.22
openqaworker-arm-2 | 4.13
openqaworker2 | 3.26
openqaworker10 | 2.83
openqaworker-arm-1 | 2.46
openqaworker-arm-3 | 2.25
QA-Power8-5-kvm | 2.2
QA-Power8-4-kvm | 2.17
powerqaworker-qam-1 | 1.9
malbec | 1.17
automotive-3 | 0.04
openqaworker-arm-5 | 0.04
openqaworker-arm-4 | 0.04
openqaworker11 | 0
(20 rows)
hm, numbers for openqaworker5 don't match. That's calculating "failed on this host divided by number of all jobs", right? But I can't put another group by host
into the subquery in the division. Not sure what to do here.
Updated by okurz about 3 years ago
mkittler helped me. The relevant keyword to look for is "aggregate functions", https://www.postgresql.org/docs/current/functions-aggregate.html . Looks good now, except for naming.
openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select sum(case when result='failed' then 1 else 0 end) * 100. / count(id), host from finished group by host;
?column? | host
------------------------+---------------------
4.2294612883819949 |
14.7977107887579329 | openqaworker-arm-2
13.4478796741174013 | openqaworker-arm-3
11.1542000788747207 | openqaworker6
28.5359801488833747 | openqaworker-arm-5
10.8831485959158268 | openqaworker3
0.00000000000000000000 | openqaworker11
32.5355594568449715 | openqaworker2
14.4309348216065378 | malbec
11.5793056501021103 | openqaworker9
13.9535532075050880 | openqaworker13
12.7698128524419060 | openqaworker5
16.2826128356503523 | openqaworker-arm-1
11.7461503341219441 | openqaworker10
11.5025709388687869 | openqaworker8
17.2077922077922078 | automotive-3
14.2372463142310777 | powerqaworker-qam-1
17.0978870202673566 | QA-Power8-4-kvm
29.2534701695476365 | grenache-1
30.1675977653631285 | openqaworker-arm-4
13.8192295379361861 | QA-Power8-5-kvm
(21 rows)
maybe quicker+easier to use count(*)
in division:
openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select sum(case when result='failed' then 1 else 0 end) * 100. / count(*), host from finished group by host;
?column? | host
------------------------+---------------------
4.2296854153938230 |
14.8001246564863869 | openqaworker-arm-2
13.4471773982975612 | openqaworker-arm-3
11.1543833703064662 | openqaworker6
28.5359801488833747 | openqaworker-arm-5
10.8825782970979791 | openqaworker3
0.00000000000000000000 | openqaworker11
32.5380619738491850 | openqaworker2
14.4309348216065378 | malbec
11.5793056501021103 | openqaworker9
13.9563411302238702 | openqaworker13
12.7680241239833568 | openqaworker5
16.2826128356503523 | openqaworker-arm-1
11.7456628206192413 | openqaworker10
11.4983389407308661 | openqaworker8
17.2077922077922078 | automotive-3
14.2372463142310777 | powerqaworker-qam-1
17.1068233265064137 | QA-Power8-4-kvm
29.2584886030454932 | grenache-1
30.1675977653631285 | openqaworker-arm-4
13.8238275622899216 | QA-Power8-5-kvm
(21 rows)
Time: 3080.002 ms (00:03.080)
the "ternary operator" is nice but there seems to be another option which we consider even nicer which is the "filter" option for any aggregate expression, see https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES:
openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select count(*) filter (where result='failed') * 100. / count(*), host from finished group by host;
?column? | host
------------------------+---------------------
4.2286704416902216 |
14.8001246564863869 | openqaworker-arm-2
13.4457730666805911 | openqaworker-arm-3
11.1576905372525810 | openqaworker6
28.4653465346534653 | openqaworker-arm-5
10.8867490658235125 | openqaworker3
0.00000000000000000000 | openqaworker11
32.5500591037718953 | openqaworker2
14.4479872459147071 | malbec
11.5840370471261237 | openqaworker9
13.9578976718248989 | openqaworker13
12.7751515327281691 | openqaworker5
16.2826128356503523 | openqaworker-arm-1
11.7441860465116279 | openqaworker10
11.5027200948328782 | openqaworker8
17.2077922077922078 | automotive-3
14.2425363927954602 | powerqaworker-qam-1
17.1122129998922065 | QA-Power8-4-kvm
29.2573941414284701 | grenache-1
30.1675977653631285 | openqaworker-arm-4
13.8238275622899216 | QA-Power8-5-kvm
(21 rows)
Time: 3096.322 ms (00:03.096)
so the final query I propose even skipping the id as we just count rows is
openqa=> with finished as (select result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host is not null group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host
---------------------+----------------------
openqaworker2 | 32.55
openqaworker-arm-4 | 30.36
grenache-1 | 29.26
openqaworker-arm-5 | 28.47
automotive-3 | 17.21
QA-Power8-4-kvm | 17.12
openqaworker-arm-1 | 16.29
openqaworker-arm-2 | 14.8
malbec | 14.45
powerqaworker-qam-1 | 14.25
openqaworker13 | 13.96
QA-Power8-5-kvm | 13.82
openqaworker-arm-3 | 13.45
openqaworker5 | 12.78
openqaworker10 | 11.74
openqaworker9 | 11.59
openqaworker8 | 11.5
openqaworker6 | 11.16
openqaworker3 | 10.89
openqaworker11 | 0
(20 rows)
Time: 2733.732 ms (00:02.734)
created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/593
Updated by okurz about 3 years ago
telegraf changes are in, now grafana: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/594
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved