action #120540: [timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" size:S - QA (public) - openSUSE Project Management Tool

Actions

Copy link

action #120540

closed

[timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" size:S

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

robert.richardson

Target version:

openQA Project (public) - Ready

Start date:

2022-11-15

Due date:

% Done:

Estimated time:

Description

Motivation¶

#73072 and #120154 . We have most of our tests based on the qemu backend which allows efficient uses. In other cases like the ppc hmc backend we treat tests like "bare metal" tests which is rather inefficient, e.g. does not allow loading VM images, AFAIK RAM can not be configured from tests limiting the number of instances per hardware instances a lot. To find out our least efficient implementations we should query the database of tests to find a "tests per hardware machine ratio".

Suggestions¶

Run SQL queries against OSD's production database to get the data either manually, e.g. ssh osd 'sudo -u geekotest psql openqa' and in there copy-paste one of the existing SQL commands from telegraf config for a a start, or use our metabase instance https://maintenance-statistics.dyn.cloud.suse.de (okta credentials, e.g. username "okurz", not email)
Compare to https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-webui.conf#L108 . Maybe easiest to just come up with the according query and replace "by host" with "by backend"
Run a calculation like number of "failed"/number of all jobs grouped by "machine" or "backend" or "WORKER_CLASS" whatever is easiest
Make list of "inefficient" vs. "efficient" backends (or is everything but QEMU "inefficient?) That's the hypothesis to prove

Actions

Copy link

Updated by okurz over 2 years ago

Target version changed from future to Ready

Actions

Copy link

Updated by okurz over 2 years ago

Subject changed from [timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" to [timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by robert.richardson over 2 years ago

Status changed from Workable to In Progress
Assignee set to robert.richardson

Actions

Copy link

Updated by robert.richardson over 2 years ago

Status changed from In Progress to Feedback

following suggestion #2 i was able to calculate the failure rate per specific machine

query:

SELECT machine, ROUND(COUNT(*) FILTER (WHERE result='failed') * 100. / COUNT(*), 2)::numeric(5,2)::float
AS ratio_failed_by_machine
FROM jobs
WHERE machine IS NOT NULL
GROUP BY machine
ORDER BY ratio_failed_by_machine DESC

result:

             machine             | ratio_failed_by_machine 
---------------------------------+-------------------------
 ipmi-merckx                     |                     100
 Standard_D8s_v3                 |                     100
 i3en.large                      |                     100
 64bit-ipmi-amd                  |                     100
 Standard_L8s_v2                 |                     100
 chan-1                          |                     100
 ix64ph1014                      |                     100
 Standard_E2as_v3                |                     100
 cfconrad_machine                |                     100
 aarch64_maintenance             |                   92.86
 ec2_t3.2xlarge                  |                      92
 az_Standard_E8s_v3              |                   90.77
 ec2_r6i.xlarge                  |                   88.24
 qam-caasp_x86_64                |                   74.39
 s390x-kvm                       |                   73.91
 ec2_m4.2xlarge                  |                    72.5
 gce_n1_highmem_8                |                   71.58
 svirt-vmware70                  |                   69.29
 64bit-ipmi-amd-zen3             |                   63.48
 windows_uefi_boot               |                   60.98
 64bit-ipmi-large-mem-intel      |                   60.13
 svirt-hyperv2012r2-uefi         |                   57.76
 UEFI                            |                   57.14
 svirt-vmware65                  |                   55.79
 laptop_64bit-virtio-vga         |                   52.54
 svirt-hyperv-uefi               |                   52.14
 ppc64le-smp                     |                   51.85
 64bit-ipmi-large-mem            |                   50.72
 ppc64-smp                       |                      50
 ipmi-64bit-unarmed              |                      50
 uefi-2G                         |                      50
 64bit-ipmi-nvdimm               |                   49.45
 svirt-hyperv                    |                   48.45
 64bit-ipmi                      |                   48.44
 win11_uefi                      |                   47.93
 ipmi-64bit-mlx_con5             |                   46.43
 az_Standard_H8                  |                   45.83
 zkvm-image                      |                   45.55
 zkvm                            |                   44.55
 s390x-zfcp                      |                   44.31
 windows_bios_boot               |                   43.75
 svirt-hyperv2012r2              |                   43.28
 svirt-kgraft                    |                   42.86
 ec2_m5d.large                   |                   40.28
 az_Standard_DC2s                |                      40
 svirt-xen-pv                    |                   39.84
 ppc64le-hmc-sap                 |                   39.58
 uefi-sap                        |                    38.6
 svirt-xen-hvm                   |                   38.15
 virt-arm-64bit-ipmi-machine     |                   37.88
 ec2_m5.metal                    |                    37.5
 svirt-hyperv2016-uefi           |                   37.33
 64bit_cirrus                    |                   37.27
 aarch64_raid                    |                   36.73
 ppc64le-hmc                     |                   36.47
 virt-s390x-kvm-sle12sp5         |                   36.26
 64bit-virtio-vga                |                   34.03
 ipmi-coppi                      |                   33.73
 svirt-hyperv2016                |                   33.66
 s390x-kvm-sle15                 |                   33.58
 64bit-ipmi-sriov                |                   33.33
 uefi-virtio-vga                 |                   31.89
 ppc64le-hmc-single-disk         |                   31.85
 s390x-zVM-ctc                   |                      30
 ipmi-sonic                      |                      30
 s390x-zVM-Upgrade-sp2           |                    29.3
 64bit-smp                       |                   28.93
 s390x-zVM-Upgrade-m1            |                   28.67
 az_Standart_L8s_v2              |                   28.57
 virt-s390x-kvm-sle15sp5         |                    28.3
 64bit-2gbram-cirrus             |                   27.75
 ec2_r4.8xlarge                  |                   27.27
 s390x-kvm-sle12-mm              |                   27.11
 s390x-zVM-vswitch-l2            |                   26.98
 RPi3B+                          |                   26.67
 ppc64le-hmc-4disk               |                   26.32
 64bit-staging                   |                   26.19
 virt-s390x-kvm-sle15sp4         |                   25.51
 uefi-staging                    |                      25
 svirt-vmware                    |                      25
 virt-mm-64bit-ipmi              |                   24.53
 64bit-amd                       |                   24.33
 virt-pvusb-64bit-ipmi           |                   24.24
 ppc64le-spvm                    |                   23.64
 RPi3B                           |                   23.33
 s390x-zVM-vswitch-l3            |                   22.69
 ppc64le                         |                   22.62
 ppc64le-2g                      |                   22.12
 s390x-zVM-hsi-l3                |                   21.97
 RPi4                            |                   21.05
 ec2_t2.large                    |                   20.38
 s390x-zVM-hsi-l2                |                   20.25
 aarch64                         |                   19.77
 s390x-kvm-sle12                 |                   19.31
 ipmi-kernel-rt                  |                   19.12
 gce_n2d_standard_2_confidential |                   18.85
 ppc64le-sap-qam                 |                   18.83
 ppc64le-sap                     |                   18.48
 gce_n1_standard_2               |                   18.16
 uefi                            |                   16.92
 s390x-zVM-Upgrade-sp1           |                   16.91
 svirt-kvm-uefi                  |                   16.67
 svirt-kvm                       |                   16.67
 ec2_a1.large                    |                   16.39
 ppc64le-no-tmpfs                |                   15.28
 caasp_x86_64                    |                   15.05
 64bit-no-tmpfs                  |                   14.91
 ipmi-coppi-xen                  |                   14.29
 aarch64-virtio                  |                   14.26
 az_Standard_A2_v2               |                   14.08
 ipmi-tails                      |                   13.16
 az_Standard_B2s                 |                   13.01
 ec2_i3.metal                    |                    12.5
 64bit-sap                       |                   12.38
 64bit                           |                   11.96
 win10_uefi                      |                   11.94
 win10_64bit                     |                   11.44
 virt-s390x-kvm-sle15sp3         |                   10.26
 bmw-mpad3                       |                   10.18
 ec2_a1.medium                   |                   10.06
 64bit-2gbram                    |                   10.04
 ipmi-tyrion                     |                     9.3
 ec2_m4.large                    |                    8.33
 64bit-sap-qam                   |                    8.19
 64bit-qxl                       |                    8.18
 ec2_i3.8xlarge                  |                    8.01
 az_Standard_L8s_v2              |                    7.44
 ec2_t2.small                    |                    6.91
 aarch64-virtio-4gbram           |                    6.87
 virt-s390x-kvm-sle15sp2         |                    6.67
 ec2_c5.large                    |                    6.38
 64bit-4gbram                    |                    6.35
 64bit_win                       |                    6.12
 ec2_m5.large                    |                    5.14
 az_Standard_E2s_v4              |                    5.05
 ppc64le-virtio                  |                    4.98
 gce_n1_standard_1               |                    4.61
 az_Standard_D2s_v4              |                    4.55
 az_Standard_B1s                 |                    4.39
 ec2_i3.large                    |                    4.38
 ec2_t3.small                    |                    4.36
 gce_n1_highmem_2                |                    4.23
 az_Standard_F2s_v2              |                    4.23
 gce_n2d_standard_2              |                    3.97
 gce_n1_highcpu_2                |                    3.19
 az_Standard_DC2s_v2             |                    3.02
 gce_f1_micro                    |                    2.49
 ec2_t3.medium                   |                       0
 ec2_r3.8xlarge                  |                       0
 ec2_c4.large                    |                       0
 s390x-zVM-Upgrade               |                       0
 virt-s390x-kvm-sle15            |                       0
 ppc64                           |                       0
 virt-s390x-kvm-sle15sp1         |                       0
 ipmi-64bit-thunderx             |                       0
 virt-s390x-kvm-sle12sp4         |                       0
(156 rows)

Actions

Copy link

Updated by okurz over 2 years ago

Nice. I suggest as the next step to create a ratio "per backend". Likely a little bit more complicated as you would need to join the according worker settings but IMHO still manageable. And what would also be helpful is to know the number of total jobs that were used to put the numbers in relation.

Actions

Copy link

Updated by robert.richardson over 2 years ago

ok, i think i have it

query:

WITH finished AS (SELECT result, backend FROM jobs LEFT JOIN machines ON jobs.machine = machines.name WHERE result != 'none')
SELECT backend, ROUND(COUNT(*) FILTER (WHERE result='failed') * 100. / COUNT(*), 2)::numeric(5,2)::float AS ratio_failed_by_backend, COUNT(*) AS job_count
FROM finished
WHERE backend IS NOT NULL
GROUP BY backend
ORDER BY ratio_failed_by_backend DESC

result:

  backend  | ratio_failed_by_backend | job_count 
-----------+-------------------------+------------------
 ipmi      |                   45.48 |             8305
 pvm_hmc   |                   32.72 |             9970
 s390x     |                   26.28 |             3961
 svirt     |                   25.85 |            70921
 spvm      |                   23.64 |             2978
 qemu      |                   13.79 |           529073
 generalhw |                   11.26 |             1297

this confirms all backends except generalhw to be inefficient compared to qemu, should i mark this ticket as resolved ?

Actions

Copy link

Updated by okurz over 2 years ago

Thank you. I guess that's all.

I wrote a message in https://suse.slack.com/archives/C02CANHLANP/p1671724262632709

@here Robert Richardson has collected nice statistics from openqa.suse.de to answer the question of stability of our various backends:

-----------+-------------------------+------------------
 ipmi      |                   45.48 |             8305
 pvm_hmc   |                   32.72 |             9970
 s390x     |                   26.28 |             3961
 svirt     |                   25.85 |            70921
 spvm      |                   23.64 |             2978
 qemu      |                   13.79 |           529073
 generalhw |                   11.26 |             1297```
so the result is in general supporting my expectations in before which are:
1. Roughly 90% of tests are running on qemu so this continues by far our most important backend
2. non-qemu backends, in particular ipmi and pvm_hmc, are 3-4x more prone to fail than qemu jobs

> What does that mean for you?
1. If you can, run tests on qemu because this is the most stable and scalable platform (regardless of architecture) and this will stay the backend for which we will provide the best support
2. If you still think you need ipmi or pvm_hmc or alike then you (as in "testing related squads") should definitely dedicate resources to improve and extend those backends, everything else will be a horrible experience. One specific example: ppc64le on qemu is unfortunately not well supported upstream or not supported anymore at all so any migration of tests to pvm_hmc should only come with according improvements in the backend and I don't see many reasonable contributions planned by teams in this direction

Actions

Copy link

Updated by robert.richardson over 2 years ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public)

Tags

Custom queries

action #120540

[timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" size:S

Motivation¶

Suggestions¶

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by robert.richardson over 2 years ago

Updated by robert.richardson over 2 years ago

Updated by okurz over 2 years ago

Updated by robert.richardson over 2 years ago

Updated by okurz over 2 years ago

Updated by robert.richardson over 2 years ago