coordination #101048
open[epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
82%
Description
Observation¶
According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%
Acceptance criteria¶
- AC1: openqaworker-arm-4/5 have a fail-ratio less or equal to arm-1/2/3
Additional information and ideas from the hardware comparison between arm-1/2/3 and arm-4/5¶
- The CPU (the specific model and version, Cavium ThunderX2) of arm-4/5 is known to behave badly for our use-case and that's the difference to the older arm workers (which have the previous version of that CPU model installed).
- Disabling cpu control and cpu frequency scaling in the firmware environment didn't make a difference.
- Before that we've already tried to reduce the number of worker slots a lot and it didn't help either.
- There are still a few ideas to consider (see #109232#note-5).
- There are also more variables in the firmware environment (see #109232#note-20) we can play with.
- Next time we should buy different hardware (see private comment #109232#note-11).
- See the full ticket #109232 for more context about these findings.
Suggestions¶
- Confirm if typing issues cause the failures (look for timeouts, observe additional or missing characters in typed commands)
- Upgrade arm3 to Leap 15.3 and compare failure rate -> #101265 => Leap 15.3 behaves similar as Leap 15.2
- Consider switching to kernel-stable or kernel-head -> #101271 => "kernel-default" from Kernel:stable behaves same as openSUSE:Leap:15.3 one
Consider downgrading kernel to what's used in 15.2-> same upstream version is running on most- Bring back arm 4 and 5 after verifying stability
- Run typing.pm from os-autoinst as test in production -> #101262
Files
Updated by okurz over 3 years ago
- Related to action #101030: Typing problems on aarch64 added
Updated by nicksinger over 3 years ago
- File sysctl_diff.html sysctl_diff.html added
First investigation shows that we run leap15.2 on the "old" workers and 15.3 on the new ones. Kernel-version seems to be quite different between different workers:
arm-1: 5.8.3-1.gbad027a-default
arm-2: 5.7.12-1.g9c98feb-default
arm-3: 5.3.18-lp152.95-default
arm-4: 5.3.18-59.27-default
arm-5: 5.3.18-59.27-default
given that arm3 has at least a similar kernel version I'd exclude kernel for now.
Kernel cmdline is managed by salt and therefore the same on all 5 machines.
I also tried to diff the sysctl's currently set in the system. Due to different kernels this is a quite tedious task and I didn't see much which could make a difference here. Attaching the diff as html.
Updated by nicksinger over 3 years ago
despite being way less utilized arm4 sees load-spikes up to 75 and is around 25 quite constantly according to: https://monitor.qa.suse.de/d/WDopenqaworker-arm-4/worker-dashboard-openqaworker-arm-4?viewPanel=54694&orgId=1&from=1633952613651&to=1634557413651 - this could hint to IO performance issues.
Updated by nicksinger over 3 years ago
Disk IO is 10x less compared to arm3 on arm4: https://monitor.qa.suse.de/d/WDopenqaworker-arm-4/worker-dashboard-openqaworker-arm-4?viewPanel=13782&orgId=1&from=1633953007558&to=1634557807558 (expected, less load) but IO response times seem to be ~50% worse according to https://monitor.qa.suse.de/d/WDopenqaworker-arm-4/worker-dashboard-openqaworker-arm-4?viewPanel=56720&orgId=1&from=1633953007558&to=1634557807558
Updated by nicksinger over 3 years ago
network IO seems fine. Higher packet drops can be observed on arm4 & arm5 (exactly the same pattern, so hinting to the switch) but IMHO this shouldn't cause such a performance-hit
Updated by livdywan over 3 years ago
- Tracker changed from action to coordination
- Subject changed from Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
- Description updated (diff)
Updated by livdywan over 3 years ago
- Copied to action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:M added
Updated by okurz over 3 years ago
- Status changed from New to Blocked
- Assignee set to okurz
blocked by subtasks
Updated by okurz about 3 years ago
- Subject changed from [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Updated by kraih about 3 years ago
There's still a lot of failed jobs from the #101271 stress test that should be searched for patterns. Maybe that will give some hints for where to look with followup investigations.
Updated by mkittler about 3 years ago
The workers arm-4/5 went offline on 05.12.2021. IPMI still responds so I invoked a power cycle. However, they both workers didn't boot successfully. They've got both stuck in the early boot:
Loading Linux 5.15.5-lp153.2.g83fc974-default ...
Loading initial ramdisk ...
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable
EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary
EFI stub: ERROR: FIRMWARE BUG: Image BSS overlaps adjacent EFI memory region
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...
INFO: Node: 0 :: REP: 0x0, REP-FAIL: 0x0, MBIST: 0x0, MBIST-FAIL: 0x803c3c
INFO: Node: 1 :: REP: 0x0, REP-FAIL: 0x0, MBIST: 0x0, MBIST-FAIL: 0x803c3c
[ 0.000000][ T0] Booting Linux on physical CPU 0x0000000000 [0x431f0af2]
[ 0.000000][ T0] Linux version 5.15.5-lp153.2.g83fc974-default (geeko@buildhost) (gcc (SUSE Linux) 11.2.1 20210816 [revision 056e324ce46a7924b5cf10f61010cf9dd2ca10e9], GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-7.26) #1 SMP Thu Nov 25 09:36:40 UTC 2021 (83fc974)
[ 0.000000][ T0] efi: EFI v2.70 by American Megatrends
[ 0.000000][ T0] efi: ESRT=0xf9515018 SMBIOS=0xfe390000 SMBIOS 3.0=0xfe380000 ACPI 2.0=0xfd8d0000 MOKvar=0xf7bd7000 MEMRESERVE=0xf4801798
[ 0.000000][ T0] esrt: Reserving ESRT space from 0x00000000f9515018 to 0x00000000f9515050.
[ 0.000000][ T0] ACPI: Early table checksum verification disabled
…
[ 39.152724][ T1] pci_bus 0000:80: resource 4 [mem 0x60000000-0x7fffffff window]
[ 39.160282][ T1] pci_bus 0000:80: resource 5 [mem 0x14000000000-0x17fffffffff window]
[ 39.168366][ T1] pci_bus 0000:91: resource 1 [mem 0x60000000-0x600fffff]
[ 39.214998][ T1] iommu: Default domain type: Passthrough
[ 39.220723][ T1] pci 0000:0d:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 39.229763][ T1] pci 0000:0d:00.0: vgaarb: bridge control possible
[ 39.236201][ T1] pci 0000:0d:00.0: vgaarb: setting as boot device (VGA legacy resources not available)
[ 39.245755][ T1] vgaarb: loaded
[ 39.249503][ T1] SCSI subsystem initialized
[ 39.254128][ T1] pps_core: LinuxPPS API ver. 1 registered
[ 39.259785][ T1] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[ 39.269652][ T1] PTP clock support registered
[ 39.274282][ T1] EDAC MC: Ver: 3.0.0
[ 39.278314][ T1] Registered efivars operations
[ 39.287410][ T1] NetLabel: Initializing
[ 39.291497][ T1] NetLabel: domain hash size = 128
[ 39.296541][ T1] NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO
[ 39.302898][ T1] NetLabel: unlabeled traffic allowed by default
<no further log messages>
I removed both workers from salt and paused the host-up alerts.
Updated by kraih about 3 years ago
I also gave power cycling arm-4 a try and for me it ended at a slightly different point:
...
[ 27.159822][ T1] pci_bus 0000:0e: resource 1 [mem 0x43100000-0x432fffff]
[ 27.196956][ T1] ARMH0011:00: ttyAMA0 at MMIO 0x402020000 root bus resource [mem 0x60000000-0x7fffffff window]
[ 38.675918][ T1] pci_bus 0000:80: root bus resource [mem 0x14000000000-0x17fffffffff window]
[ 38.684606][ T1] pci_bus 0000:80: root bus resource [bus 80-ff]
[ 38.690813][ T1] pci 0000:80:00.0: [177d:af00] type 00 class 0x060000
[ 38.697639][ T1] pci 0000:80:01.0: [177d:af84] type 01 class 0x060400
[ 38.704369][ T1] pci 0000:80:01.0: PME# supported from D0 D3hot D3cold
[ 38.711273][ T1] pci 0000:80:02.0: [177d:af84] type 01 class 0x060400
[ 38.718000][ T1] pci 0000:80:02.0: PME# supported from D0 D3hot D3cold
[ 38.724911][ T1] pci 0000:80:03.0: [177d:af84] type 01 class 0x060400
[ 38.731640][ T1] pci 0000:80:03.0: PME# supported from D0 D3hot D3cold
[ 38.738539][ T1] pci 0000:80:04.0: [177d:af84] type 01 class 0x060400
[ 38.745267][ T1] pci 0000:80:04.0: PME# supported from D0 D3hot D3cold
[ 38.752161][ T1] pci 0000:80:05.0: [177d:af84] type 01 class 0x060400
[ 38.758892][ T1] pcrom D0 D3hot D3cold
[ 38.888346][ T1] pci 0000:80:0f.0: [14e4:9026] type 00 class 0x0c0330
[ 38.895048][ T1] pci 0000:80:0f.0: reg 0x10: [mem 0x14000030000-0x1400003ffff 64bit pref]
[ 38.903479][ T1] pci 0000:80:0f.0: reg 0x18: [mem 0x14000020000-0x1400002ffff 64bit pref]
[ 38.911999][ T1] pci 0000:80:0f.1: [14e4:9026] type 00 class 0x0c0330
[ 38.918695][ T1] pci 0000:80:0f.1: reg 0x10: [mem 0x14000010000-0x1400001ffff 64bit pref]
[ 38.927129][ T1] pci 0000:80:0f.1: reg 0x18: [mem 0x14000000000-0x1400000ffff 64bit pref]
[ 38.935693][ T1] acpiphp: Slot [1] registered
[ 38.940380][ T1] acpiphp: Slot [1-1] registered
[ 38.945240][ T1] acpiphp: Slot [1-2] registered
[ 38.950096][ T1] acpiphp: Slot [1-3] registered
[ 38.954941][ T1] pci 0000:91:00.0: [8086:0a54] type 00 class 0x010802
[ 38.961645][ T1] pci 0000:91:00.0: reg 0x10: [mem 0x60000000-0x60003fff 64bit]
[ 38.969148][ T1] pci 0000:91:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pr
<no further log messages>
The machine does boot with the 5.14.14 kernel though. Upgrading to 5.15.10 did not work, gets stuck at the same point during boot.
Updated by kraih about 3 years ago
Downgraded both machines to the default Leap 15.3 kernel, so they are working again.
Updated by okurz about 3 years ago
- Status changed from Workable to Blocked
- Assignee set to okurz
Updated by okurz about 3 years ago
- Status changed from Blocked to New
- Assignee deleted (
okurz)
all current subtasks resolved. We need to brainstorm again together what to do.
Updated by mkittler almost 3 years ago
Yes. The outcome of #104304 is that os-autoinst's fullstack test is not sufficient to find any difference between arm-1/2/3 and 4/5.
Updated by okurz almost 3 years ago
The last time we spoke about we thought of the idea to involve ARM experts. okurz asked ggardet_arm in https://app.element.io/#/room/#openqa:opensuse.org (or maybe it was #opensuse-factory ) and he offered help but needs more details about the machines. So I suggest to get details, e.g. log in, call dmesg and dmidecode and provide that details in the ticket and ask ggardet_arm again. Maybe something about hugepages, cpu flags, some boot kernel parameters to work around I/O quirks, anything like that. We created a new specific suggestion in a subtask.
Updated by szarate over 2 years ago
Conversation https://suse.slack.com/archives/C02CANHLANP/p1656568851927729 is also related
Updated by okurz over 1 year ago
- Status changed from Blocked to New
- Assignee deleted (
mkittler) - Target version changed from Ready to future
- Parent task changed from #109743 to #121732