Project

General

Profile

Actions

action #120540

closed

[timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" size:S

Added by okurz about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Start date:
2022-11-15
Due date:
% Done:

0%

Estimated time:

Description

Motivation

#73072 and #120154 . We have most of our tests based on the qemu backend which allows efficient uses. In other cases like the ppc hmc backend we treat tests like "bare metal" tests which is rather inefficient, e.g. does not allow loading VM images, AFAIK RAM can not be configured from tests limiting the number of instances per hardware instances a lot. To find out our least efficient implementations we should query the database of tests to find a "tests per hardware machine ratio".

Suggestions

  • Run SQL queries against OSD's production database to get the data either manually, e.g. ssh osd 'sudo -u geekotest psql openqa' and in there copy-paste one of the existing SQL commands from telegraf config for a a start, or use our metabase instance https://maintenance-statistics.dyn.cloud.suse.de (okta credentials, e.g. username "okurz", not email)
  • Compare to https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-webui.conf#L108 . Maybe easiest to just come up with the according query and replace "by host" with "by backend"
  • Run a calculation like number of "failed"/number of all jobs grouped by "machine" or "backend" or "WORKER_CLASS" whatever is easiest
  • Make list of "inefficient" vs. "efficient" backends (or is everything but QEMU "inefficient?) That's the hypothesis to prove
Actions #1

Updated by okurz about 2 years ago

  • Target version changed from future to Ready
Actions #2

Updated by okurz almost 2 years ago

  • Subject changed from [timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" to [timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by robert.richardson almost 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to robert.richardson
Actions #4

Updated by robert.richardson almost 2 years ago

  • Status changed from In Progress to Feedback

following suggestion #2 i was able to calculate the failure rate per specific machine

query:

SELECT machine, ROUND(COUNT(*) FILTER (WHERE result='failed') * 100. / COUNT(*), 2)::numeric(5,2)::float
AS ratio_failed_by_machine
FROM jobs
WHERE machine IS NOT NULL
GROUP BY machine
ORDER BY ratio_failed_by_machine DESC

result:

             machine             | ratio_failed_by_machine 
---------------------------------+-------------------------
 ipmi-merckx                     |                     100
 Standard_D8s_v3                 |                     100
 i3en.large                      |                     100
 64bit-ipmi-amd                  |                     100
 Standard_L8s_v2                 |                     100
 chan-1                          |                     100
 ix64ph1014                      |                     100
 Standard_E2as_v3                |                     100
 cfconrad_machine                |                     100
 aarch64_maintenance             |                   92.86
 ec2_t3.2xlarge                  |                      92
 az_Standard_E8s_v3              |                   90.77
 ec2_r6i.xlarge                  |                   88.24
 qam-caasp_x86_64                |                   74.39
 s390x-kvm                       |                   73.91
 ec2_m4.2xlarge                  |                    72.5
 gce_n1_highmem_8                |                   71.58
 svirt-vmware70                  |                   69.29
 64bit-ipmi-amd-zen3             |                   63.48
 windows_uefi_boot               |                   60.98
 64bit-ipmi-large-mem-intel      |                   60.13
 svirt-hyperv2012r2-uefi         |                   57.76
 UEFI                            |                   57.14
 svirt-vmware65                  |                   55.79
 laptop_64bit-virtio-vga         |                   52.54
 svirt-hyperv-uefi               |                   52.14
 ppc64le-smp                     |                   51.85
 64bit-ipmi-large-mem            |                   50.72
 ppc64-smp                       |                      50
 ipmi-64bit-unarmed              |                      50
 uefi-2G                         |                      50
 64bit-ipmi-nvdimm               |                   49.45
 svirt-hyperv                    |                   48.45
 64bit-ipmi                      |                   48.44
 win11_uefi                      |                   47.93
 ipmi-64bit-mlx_con5             |                   46.43
 az_Standard_H8                  |                   45.83
 zkvm-image                      |                   45.55
 zkvm                            |                   44.55
 s390x-zfcp                      |                   44.31
 windows_bios_boot               |                   43.75
 svirt-hyperv2012r2              |                   43.28
 svirt-kgraft                    |                   42.86
 ec2_m5d.large                   |                   40.28
 az_Standard_DC2s                |                      40
 svirt-xen-pv                    |                   39.84
 ppc64le-hmc-sap                 |                   39.58
 uefi-sap                        |                    38.6
 svirt-xen-hvm                   |                   38.15
 virt-arm-64bit-ipmi-machine     |                   37.88
 ec2_m5.metal                    |                    37.5
 svirt-hyperv2016-uefi           |                   37.33
 64bit_cirrus                    |                   37.27
 aarch64_raid                    |                   36.73
 ppc64le-hmc                     |                   36.47
 virt-s390x-kvm-sle12sp5         |                   36.26
 64bit-virtio-vga                |                   34.03
 ipmi-coppi                      |                   33.73
 svirt-hyperv2016                |                   33.66
 s390x-kvm-sle15                 |                   33.58
 64bit-ipmi-sriov                |                   33.33
 uefi-virtio-vga                 |                   31.89
 ppc64le-hmc-single-disk         |                   31.85
 s390x-zVM-ctc                   |                      30
 ipmi-sonic                      |                      30
 s390x-zVM-Upgrade-sp2           |                    29.3
 64bit-smp                       |                   28.93
 s390x-zVM-Upgrade-m1            |                   28.67
 az_Standart_L8s_v2              |                   28.57
 virt-s390x-kvm-sle15sp5         |                    28.3
 64bit-2gbram-cirrus             |                   27.75
 ec2_r4.8xlarge                  |                   27.27
 s390x-kvm-sle12-mm              |                   27.11
 s390x-zVM-vswitch-l2            |                   26.98
 RPi3B+                          |                   26.67
 ppc64le-hmc-4disk               |                   26.32
 64bit-staging                   |                   26.19
 virt-s390x-kvm-sle15sp4         |                   25.51
 uefi-staging                    |                      25
 svirt-vmware                    |                      25
 virt-mm-64bit-ipmi              |                   24.53
 64bit-amd                       |                   24.33
 virt-pvusb-64bit-ipmi           |                   24.24
 ppc64le-spvm                    |                   23.64
 RPi3B                           |                   23.33
 s390x-zVM-vswitch-l3            |                   22.69
 ppc64le                         |                   22.62
 ppc64le-2g                      |                   22.12
 s390x-zVM-hsi-l3                |                   21.97
 RPi4                            |                   21.05
 ec2_t2.large                    |                   20.38
 s390x-zVM-hsi-l2                |                   20.25
 aarch64                         |                   19.77
 s390x-kvm-sle12                 |                   19.31
 ipmi-kernel-rt                  |                   19.12
 gce_n2d_standard_2_confidential |                   18.85
 ppc64le-sap-qam                 |                   18.83
 ppc64le-sap                     |                   18.48
 gce_n1_standard_2               |                   18.16
 uefi                            |                   16.92
 s390x-zVM-Upgrade-sp1           |                   16.91
 svirt-kvm-uefi                  |                   16.67
 svirt-kvm                       |                   16.67
 ec2_a1.large                    |                   16.39
 ppc64le-no-tmpfs                |                   15.28
 caasp_x86_64                    |                   15.05
 64bit-no-tmpfs                  |                   14.91
 ipmi-coppi-xen                  |                   14.29
 aarch64-virtio                  |                   14.26
 az_Standard_A2_v2               |                   14.08
 ipmi-tails                      |                   13.16
 az_Standard_B2s                 |                   13.01
 ec2_i3.metal                    |                    12.5
 64bit-sap                       |                   12.38
 64bit                           |                   11.96
 win10_uefi                      |                   11.94
 win10_64bit                     |                   11.44
 virt-s390x-kvm-sle15sp3         |                   10.26
 bmw-mpad3                       |                   10.18
 ec2_a1.medium                   |                   10.06
 64bit-2gbram                    |                   10.04
 ipmi-tyrion                     |                     9.3
 ec2_m4.large                    |                    8.33
 64bit-sap-qam                   |                    8.19
 64bit-qxl                       |                    8.18
 ec2_i3.8xlarge                  |                    8.01
 az_Standard_L8s_v2              |                    7.44
 ec2_t2.small                    |                    6.91
 aarch64-virtio-4gbram           |                    6.87
 virt-s390x-kvm-sle15sp2         |                    6.67
 ec2_c5.large                    |                    6.38
 64bit-4gbram                    |                    6.35
 64bit_win                       |                    6.12
 ec2_m5.large                    |                    5.14
 az_Standard_E2s_v4              |                    5.05
 ppc64le-virtio                  |                    4.98
 gce_n1_standard_1               |                    4.61
 az_Standard_D2s_v4              |                    4.55
 az_Standard_B1s                 |                    4.39
 ec2_i3.large                    |                    4.38
 ec2_t3.small                    |                    4.36
 gce_n1_highmem_2                |                    4.23
 az_Standard_F2s_v2              |                    4.23
 gce_n2d_standard_2              |                    3.97
 gce_n1_highcpu_2                |                    3.19
 az_Standard_DC2s_v2             |                    3.02
 gce_f1_micro                    |                    2.49
 ec2_t3.medium                   |                       0
 ec2_r3.8xlarge                  |                       0
 ec2_c4.large                    |                       0
 s390x-zVM-Upgrade               |                       0
 virt-s390x-kvm-sle15            |                       0
 ppc64                           |                       0
 virt-s390x-kvm-sle15sp1         |                       0
 ipmi-64bit-thunderx             |                       0
 virt-s390x-kvm-sle12sp4         |                       0
(156 rows)
Actions #5

Updated by okurz almost 2 years ago

Nice. I suggest as the next step to create a ratio "per backend". Likely a little bit more complicated as you would need to join the according worker settings but IMHO still manageable. And what would also be helpful is to know the number of total jobs that were used to put the numbers in relation.

Actions #6

Updated by robert.richardson almost 2 years ago

ok, i think i have it

query:

WITH finished AS (SELECT result, backend FROM jobs LEFT JOIN machines ON jobs.machine = machines.name WHERE result != 'none')
SELECT backend, ROUND(COUNT(*) FILTER (WHERE result='failed') * 100. / COUNT(*), 2)::numeric(5,2)::float AS ratio_failed_by_backend, COUNT(*) AS job_count
FROM finished
WHERE backend IS NOT NULL
GROUP BY backend
ORDER BY ratio_failed_by_backend DESC

result:

  backend  | ratio_failed_by_backend | job_count 
-----------+-------------------------+------------------
 ipmi      |                   45.48 |             8305
 pvm_hmc   |                   32.72 |             9970
 s390x     |                   26.28 |             3961
 svirt     |                   25.85 |            70921
 spvm      |                   23.64 |             2978
 qemu      |                   13.79 |           529073
 generalhw |                   11.26 |             1297

this confirms all backends except generalhw to be inefficient compared to qemu, should i mark this ticket as resolved ?

Actions #7

Updated by okurz almost 2 years ago

Thank you. I guess that's all.

I wrote a message in https://suse.slack.com/archives/C02CANHLANP/p1671724262632709

@here Robert Richardson has collected nice statistics from openqa.suse.de to answer the question of stability of our various backends:
backend | ratio_failed_by_backend | job_count
-----------+-------------------------+------------------
ipmi | 45.48 | 8305
pvm_hmc | 32.72 | 9970
s390x | 26.28 | 3961
svirt | 25.85 | 70921
spvm | 23.64 | 2978
qemu | 13.79 | 529073
generalhw | 11.26 | 1297

so the result is in general supporting my expectations in before which are:

  1. Roughly 90% of tests are running on qemu so this continues by far our most important backend
  2. non-qemu backends, in particular ipmi and pvm_hmc, are 3-4x more prone to fail than qemu jobs

What does that mean for you?

  1. If you can, run tests on qemu because this is the most stable and scalable platform (regardless of architecture) and this will stay the backend for which we will provide the best support
  2. If you still think you need ipmi or pvm_hmc or alike then you (as in "testing related squads") should definitely dedicate resources to improve and extend those backends, everything else will be a horrible experience. One specific example: ppc64le on qemu is unfortunately not well supported upstream or not supported anymore at all so any migration of tests to pvm_hmc should only come with according improvements in the backend and I don't see many reasonable contributions planned by teams in this direction
Actions #8

Updated by robert.richardson almost 2 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF