action #120540
closed[timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" size:S
0%
Description
Motivation¶
#73072 and #120154 . We have most of our tests based on the qemu backend which allows efficient uses. In other cases like the ppc hmc backend we treat tests like "bare metal" tests which is rather inefficient, e.g. does not allow loading VM images, AFAIK RAM can not be configured from tests limiting the number of instances per hardware instances a lot. To find out our least efficient implementations we should query the database of tests to find a "tests per hardware machine ratio".
Suggestions¶
- Run SQL queries against OSD's production database to get the data either manually, e.g.
ssh osd 'sudo -u geekotest psql openqa'
and in there copy-paste one of the existing SQL commands from telegraf config for a a start, or use our metabase instance https://maintenance-statistics.dyn.cloud.suse.de (okta credentials, e.g. username "okurz", not email) - Compare to https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-webui.conf#L108 . Maybe easiest to just come up with the according query and replace "by host" with "by backend"
- Run a calculation like
number of "failed"/number of all jobs
grouped by "machine" or "backend" or "WORKER_CLASS" whatever is easiest - Make list of "inefficient" vs. "efficient" backends (or is everything but QEMU "inefficient?) That's the hypothesis to prove
Updated by okurz almost 2 years ago
- Subject changed from [timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" to [timeboxed:10h][research] Find inefficient test implementations and backend use by a "tests per hardware machine ratio" size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by robert.richardson almost 2 years ago
- Status changed from Workable to In Progress
- Assignee set to robert.richardson
Updated by robert.richardson almost 2 years ago
- Status changed from In Progress to Feedback
following suggestion #2 i was able to calculate the failure rate per specific machine
query:
SELECT machine, ROUND(COUNT(*) FILTER (WHERE result='failed') * 100. / COUNT(*), 2)::numeric(5,2)::float
AS ratio_failed_by_machine
FROM jobs
WHERE machine IS NOT NULL
GROUP BY machine
ORDER BY ratio_failed_by_machine DESC
result:
machine | ratio_failed_by_machine
---------------------------------+-------------------------
ipmi-merckx | 100
Standard_D8s_v3 | 100
i3en.large | 100
64bit-ipmi-amd | 100
Standard_L8s_v2 | 100
chan-1 | 100
ix64ph1014 | 100
Standard_E2as_v3 | 100
cfconrad_machine | 100
aarch64_maintenance | 92.86
ec2_t3.2xlarge | 92
az_Standard_E8s_v3 | 90.77
ec2_r6i.xlarge | 88.24
qam-caasp_x86_64 | 74.39
s390x-kvm | 73.91
ec2_m4.2xlarge | 72.5
gce_n1_highmem_8 | 71.58
svirt-vmware70 | 69.29
64bit-ipmi-amd-zen3 | 63.48
windows_uefi_boot | 60.98
64bit-ipmi-large-mem-intel | 60.13
svirt-hyperv2012r2-uefi | 57.76
UEFI | 57.14
svirt-vmware65 | 55.79
laptop_64bit-virtio-vga | 52.54
svirt-hyperv-uefi | 52.14
ppc64le-smp | 51.85
64bit-ipmi-large-mem | 50.72
ppc64-smp | 50
ipmi-64bit-unarmed | 50
uefi-2G | 50
64bit-ipmi-nvdimm | 49.45
svirt-hyperv | 48.45
64bit-ipmi | 48.44
win11_uefi | 47.93
ipmi-64bit-mlx_con5 | 46.43
az_Standard_H8 | 45.83
zkvm-image | 45.55
zkvm | 44.55
s390x-zfcp | 44.31
windows_bios_boot | 43.75
svirt-hyperv2012r2 | 43.28
svirt-kgraft | 42.86
ec2_m5d.large | 40.28
az_Standard_DC2s | 40
svirt-xen-pv | 39.84
ppc64le-hmc-sap | 39.58
uefi-sap | 38.6
svirt-xen-hvm | 38.15
virt-arm-64bit-ipmi-machine | 37.88
ec2_m5.metal | 37.5
svirt-hyperv2016-uefi | 37.33
64bit_cirrus | 37.27
aarch64_raid | 36.73
ppc64le-hmc | 36.47
virt-s390x-kvm-sle12sp5 | 36.26
64bit-virtio-vga | 34.03
ipmi-coppi | 33.73
svirt-hyperv2016 | 33.66
s390x-kvm-sle15 | 33.58
64bit-ipmi-sriov | 33.33
uefi-virtio-vga | 31.89
ppc64le-hmc-single-disk | 31.85
s390x-zVM-ctc | 30
ipmi-sonic | 30
s390x-zVM-Upgrade-sp2 | 29.3
64bit-smp | 28.93
s390x-zVM-Upgrade-m1 | 28.67
az_Standart_L8s_v2 | 28.57
virt-s390x-kvm-sle15sp5 | 28.3
64bit-2gbram-cirrus | 27.75
ec2_r4.8xlarge | 27.27
s390x-kvm-sle12-mm | 27.11
s390x-zVM-vswitch-l2 | 26.98
RPi3B+ | 26.67
ppc64le-hmc-4disk | 26.32
64bit-staging | 26.19
virt-s390x-kvm-sle15sp4 | 25.51
uefi-staging | 25
svirt-vmware | 25
virt-mm-64bit-ipmi | 24.53
64bit-amd | 24.33
virt-pvusb-64bit-ipmi | 24.24
ppc64le-spvm | 23.64
RPi3B | 23.33
s390x-zVM-vswitch-l3 | 22.69
ppc64le | 22.62
ppc64le-2g | 22.12
s390x-zVM-hsi-l3 | 21.97
RPi4 | 21.05
ec2_t2.large | 20.38
s390x-zVM-hsi-l2 | 20.25
aarch64 | 19.77
s390x-kvm-sle12 | 19.31
ipmi-kernel-rt | 19.12
gce_n2d_standard_2_confidential | 18.85
ppc64le-sap-qam | 18.83
ppc64le-sap | 18.48
gce_n1_standard_2 | 18.16
uefi | 16.92
s390x-zVM-Upgrade-sp1 | 16.91
svirt-kvm-uefi | 16.67
svirt-kvm | 16.67
ec2_a1.large | 16.39
ppc64le-no-tmpfs | 15.28
caasp_x86_64 | 15.05
64bit-no-tmpfs | 14.91
ipmi-coppi-xen | 14.29
aarch64-virtio | 14.26
az_Standard_A2_v2 | 14.08
ipmi-tails | 13.16
az_Standard_B2s | 13.01
ec2_i3.metal | 12.5
64bit-sap | 12.38
64bit | 11.96
win10_uefi | 11.94
win10_64bit | 11.44
virt-s390x-kvm-sle15sp3 | 10.26
bmw-mpad3 | 10.18
ec2_a1.medium | 10.06
64bit-2gbram | 10.04
ipmi-tyrion | 9.3
ec2_m4.large | 8.33
64bit-sap-qam | 8.19
64bit-qxl | 8.18
ec2_i3.8xlarge | 8.01
az_Standard_L8s_v2 | 7.44
ec2_t2.small | 6.91
aarch64-virtio-4gbram | 6.87
virt-s390x-kvm-sle15sp2 | 6.67
ec2_c5.large | 6.38
64bit-4gbram | 6.35
64bit_win | 6.12
ec2_m5.large | 5.14
az_Standard_E2s_v4 | 5.05
ppc64le-virtio | 4.98
gce_n1_standard_1 | 4.61
az_Standard_D2s_v4 | 4.55
az_Standard_B1s | 4.39
ec2_i3.large | 4.38
ec2_t3.small | 4.36
gce_n1_highmem_2 | 4.23
az_Standard_F2s_v2 | 4.23
gce_n2d_standard_2 | 3.97
gce_n1_highcpu_2 | 3.19
az_Standard_DC2s_v2 | 3.02
gce_f1_micro | 2.49
ec2_t3.medium | 0
ec2_r3.8xlarge | 0
ec2_c4.large | 0
s390x-zVM-Upgrade | 0
virt-s390x-kvm-sle15 | 0
ppc64 | 0
virt-s390x-kvm-sle15sp1 | 0
ipmi-64bit-thunderx | 0
virt-s390x-kvm-sle12sp4 | 0
(156 rows)
Updated by okurz almost 2 years ago
Nice. I suggest as the next step to create a ratio "per backend". Likely a little bit more complicated as you would need to join the according worker settings but IMHO still manageable. And what would also be helpful is to know the number of total jobs that were used to put the numbers in relation.
Updated by robert.richardson almost 2 years ago
ok, i think i have it
query:
WITH finished AS (SELECT result, backend FROM jobs LEFT JOIN machines ON jobs.machine = machines.name WHERE result != 'none')
SELECT backend, ROUND(COUNT(*) FILTER (WHERE result='failed') * 100. / COUNT(*), 2)::numeric(5,2)::float AS ratio_failed_by_backend, COUNT(*) AS job_count
FROM finished
WHERE backend IS NOT NULL
GROUP BY backend
ORDER BY ratio_failed_by_backend DESC
result:
backend | ratio_failed_by_backend | job_count
-----------+-------------------------+------------------
ipmi | 45.48 | 8305
pvm_hmc | 32.72 | 9970
s390x | 26.28 | 3961
svirt | 25.85 | 70921
spvm | 23.64 | 2978
qemu | 13.79 | 529073
generalhw | 11.26 | 1297
this confirms all backends except generalhw to be inefficient compared to qemu, should i mark this ticket as resolved ?
Updated by okurz almost 2 years ago
Thank you. I guess that's all.
I wrote a message in https://suse.slack.com/archives/C02CANHLANP/p1671724262632709
@here Robert Richardson has collected nice statistics from openqa.suse.de to answer the question of stability of our various backends:
backend | ratio_failed_by_backend | job_count
-----------+-------------------------+------------------
ipmi | 45.48 | 8305
pvm_hmc | 32.72 | 9970
s390x | 26.28 | 3961
svirt | 25.85 | 70921
spvm | 23.64 | 2978
qemu | 13.79 | 529073
generalhw | 11.26 | 1297
so the result is in general supporting my expectations in before which are:
- Roughly 90% of tests are running on qemu so this continues by far our most important backend
- non-qemu backends, in particular ipmi and pvm_hmc, are 3-4x more prone to fail than qemu jobs
What does that mean for you?
- If you can, run tests on qemu because this is the most stable and scalable platform (regardless of architecture) and this will stay the backend for which we will provide the best support
- If you still think you need ipmi or pvm_hmc or alike then you (as in "testing related squads") should definitely dedicate resources to improve and extend those backends, everything else will be a horrible experience. One specific example: ppc64le on qemu is unfortunately not well supported upstream or not supported anymore at all so any migration of tests to pvm_hmc should only come with according improvements in the backend and I don't see many reasonable contributions planned by teams in this direction
Updated by robert.richardson almost 2 years ago
- Status changed from Feedback to Resolved