Project

General

Profile

Actions

action #99135

closed

coordination #96185: [epic] Multimachine failure rate increased

Provide ratio of tests by result in monitoring - by worker

Added by okurz about 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See parent #96185

Suggestion

  • Understand what has been done in #96191
  • Extend telegraf+grafana to show tests by result by worker

Related issues 2 (1 open1 closed)

Copied from openQA Project - action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:MResolvedokurz2021-07-282021-09-29

Actions
Copied to openQA Infrastructure - action #99138: Provide ratio of tests by result in monitoring - by job groupNew

Actions
Actions #1

Updated by okurz about 3 years ago

  • Copied from action #96191: Provide "fail-rate" of tests, especially multi-machine, in grafana size:M added
Actions #2

Updated by okurz about 3 years ago

  • Copied to action #99138: Provide ratio of tests by result in monitoring - by job group added
Actions #3

Updated by okurz about 3 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(finished_jobs.id) * 100. / (select count(*) from finished_jobs where result = 'failed'), 2)::numeric(5,2)::float as ratio_failed_by_host from finished_jobs where host is not null group by host order by ratio_failed_by_host desc;
        host         | ratio_failed_by_host 
---------------------+----------------------
 openqaworker5       |                73.83
 openqaworker6       |                65.32
 openqaworker8       |                 50.8
 openqaworker9       |                47.23
 openqaworker3       |                44.86
 grenache-1          |                43.52
 openqaworker13      |                38.54
 openqaworker-arm-2  |                38.01
 openqaworker2       |                28.82
 openqaworker10      |                25.61
 openqaworker-arm-1  |                22.74
 openqaworker-arm-3  |                20.78
 QA-Power8-5-kvm     |                 19.9
 QA-Power8-4-kvm     |                19.53
 powerqaworker-qam-1 |                17.16
 malbec              |                10.62
 automotive-3        |                 0.83
 openqaworker-arm-5  |                  0.2
 openqaworker-arm-4  |                 0.16
 openqaworker11      |                 0.03
(20 rows)

Time: 28402.305 ms (00:28.402)
Actions #4

Updated by openqa_review about 3 years ago

  • Due date set to 2021-10-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz about 3 years ago

The above does not look right. I don't think count(*) is doing what we want.

Checking numbers manually:

openqa=> select count(jobs.id) from jobs left join workers on jobs.assigned_worker_id = workers.id where host = 'openqaworker5';
 count 
-------
 69322
(1 row)

openqa=> select count(jobs.id) from jobs left join workers on jobs.assigned_worker_id = workers.id where host = 'openqaworker5' and result = 'failed';
 count 
-------
  8873
(1 row)

openqa=> select 8873 * 100./ 69322
      ?column?       
---------------------
 12.7996884106055798
(1 row)

so this is better:

openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none' and host ~ 'openqaworker5') select count(finished_jobs.id) * 100. / (select count(j.id) from finished_jobs as j) from finished_jobs where result='failed';                                              
      ?column?       
---------------------
 12.7991345113595384
(1 row)

and now try to group by host again

openqa=> with finished_jobs as (select distinct jobs.*, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(finished_jobs.id) * 100. / (select count(j.id) from finished_jobs as j), 2)::numeric(5,2)::float as ratio_failed_by_host from finished_jobs where host is not null group by host order by ratio_failed_by_host desc;
        host         | ratio_failed_by_host 
---------------------+----------------------
 openqaworker5       |                 8.09
 openqaworker6       |                 7.13
 openqaworker8       |                 5.55
 openqaworker9       |                 5.16
 grenache-1          |                 4.93
 openqaworker3       |                  4.9
 openqaworker13      |                 4.22
 openqaworker-arm-2  |                 4.13
 openqaworker2       |                 3.26
 openqaworker10      |                 2.83
 openqaworker-arm-1  |                 2.46
 openqaworker-arm-3  |                 2.25
 QA-Power8-5-kvm     |                  2.2
 QA-Power8-4-kvm     |                 2.17
 powerqaworker-qam-1 |                  1.9
 malbec              |                 1.17
 automotive-3        |                 0.04
 openqaworker-arm-5  |                 0.04
 openqaworker-arm-4  |                 0.04
 openqaworker11      |                    0
(20 rows)

hm, numbers for openqaworker5 don't match. That's calculating "failed on this host divided by number of all jobs", right? But I can't put another group by host into the subquery in the division. Not sure what to do here.

Actions #6

Updated by okurz about 3 years ago

mkittler helped me. The relevant keyword to look for is "aggregate functions", https://www.postgresql.org/docs/current/functions-aggregate.html . Looks good now, except for naming.

openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select sum(case when result='failed' then 1 else 0 end) * 100. / count(id), host from finished group by host;
        ?column?        |        host         
------------------------+---------------------
     4.2294612883819949 | 
    14.7977107887579329 | openqaworker-arm-2
    13.4478796741174013 | openqaworker-arm-3
    11.1542000788747207 | openqaworker6
    28.5359801488833747 | openqaworker-arm-5
    10.8831485959158268 | openqaworker3
 0.00000000000000000000 | openqaworker11
    32.5355594568449715 | openqaworker2
    14.4309348216065378 | malbec
    11.5793056501021103 | openqaworker9
    13.9535532075050880 | openqaworker13
    12.7698128524419060 | openqaworker5
    16.2826128356503523 | openqaworker-arm-1
    11.7461503341219441 | openqaworker10
    11.5025709388687869 | openqaworker8
    17.2077922077922078 | automotive-3
    14.2372463142310777 | powerqaworker-qam-1
    17.0978870202673566 | QA-Power8-4-kvm
    29.2534701695476365 | grenache-1
    30.1675977653631285 | openqaworker-arm-4
    13.8192295379361861 | QA-Power8-5-kvm
(21 rows)

maybe quicker+easier to use count(*) in division:

openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select sum(case when result='failed' then 1 else 0 end) * 100. / count(*), host from finished group by host;                                                                                
        ?column?        |        host         
------------------------+---------------------
     4.2296854153938230 | 
    14.8001246564863869 | openqaworker-arm-2
    13.4471773982975612 | openqaworker-arm-3
    11.1543833703064662 | openqaworker6
    28.5359801488833747 | openqaworker-arm-5
    10.8825782970979791 | openqaworker3
 0.00000000000000000000 | openqaworker11
    32.5380619738491850 | openqaworker2
    14.4309348216065378 | malbec
    11.5793056501021103 | openqaworker9
    13.9563411302238702 | openqaworker13
    12.7680241239833568 | openqaworker5
    16.2826128356503523 | openqaworker-arm-1
    11.7456628206192413 | openqaworker10
    11.4983389407308661 | openqaworker8
    17.2077922077922078 | automotive-3
    14.2372463142310777 | powerqaworker-qam-1
    17.1068233265064137 | QA-Power8-4-kvm
    29.2584886030454932 | grenache-1
    30.1675977653631285 | openqaworker-arm-4
    13.8238275622899216 | QA-Power8-5-kvm
(21 rows)

Time: 3080.002 ms (00:03.080)

the "ternary operator" is nice but there seems to be another option which we consider even nicer which is the "filter" option for any aggregate expression, see https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES:

openqa=> with finished as (select jobs.id, result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select count(*) filter (where result='failed') * 100. / count(*), host from finished group by host;                                                                                                                   
        ?column?        |        host         
------------------------+---------------------
     4.2286704416902216 | 
    14.8001246564863869 | openqaworker-arm-2
    13.4457730666805911 | openqaworker-arm-3
    11.1576905372525810 | openqaworker6
    28.4653465346534653 | openqaworker-arm-5
    10.8867490658235125 | openqaworker3
 0.00000000000000000000 | openqaworker11
    32.5500591037718953 | openqaworker2
    14.4479872459147071 | malbec
    11.5840370471261237 | openqaworker9
    13.9578976718248989 | openqaworker13
    12.7751515327281691 | openqaworker5
    16.2826128356503523 | openqaworker-arm-1
    11.7441860465116279 | openqaworker10
    11.5027200948328782 | openqaworker8
    17.2077922077922078 | automotive-3
    14.2425363927954602 | powerqaworker-qam-1
    17.1122129998922065 | QA-Power8-4-kvm
    29.2573941414284701 | grenache-1
    30.1675977653631285 | openqaworker-arm-4
    13.8238275622899216 | QA-Power8-5-kvm
(21 rows)

Time: 3096.322 ms (00:03.096)

so the final query I propose even skipping the id as we just count rows is

openqa=> with finished as (select result, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host is not null group by host order by ratio_failed_by_host desc;           
        host         | ratio_failed_by_host 
---------------------+----------------------
 openqaworker2       |                32.55
 openqaworker-arm-4  |                30.36
 grenache-1          |                29.26
 openqaworker-arm-5  |                28.47
 automotive-3        |                17.21
 QA-Power8-4-kvm     |                17.12
 openqaworker-arm-1  |                16.29
 openqaworker-arm-2  |                 14.8
 malbec              |                14.45
 powerqaworker-qam-1 |                14.25
 openqaworker13      |                13.96
 QA-Power8-5-kvm     |                13.82
 openqaworker-arm-3  |                13.45
 openqaworker5       |                12.78
 openqaworker10      |                11.74
 openqaworker9       |                11.59
 openqaworker8       |                 11.5
 openqaworker6       |                11.16
 openqaworker3       |                10.89
 openqaworker11      |                    0
(20 rows)

Time: 2733.732 ms (00:02.734)

created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/593

Actions #7

Updated by okurz about 3 years ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by okurz about 3 years ago

Actions #9

Updated by okurz about 3 years ago

  • Status changed from Feedback to Resolved
Actions #10

Updated by okurz almost 3 years ago

  • Due date deleted (2021-10-09)
Actions

Also available in: Atom PDF