coordination #101048
open
[epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Added by okurz over 3 years ago.
Updated over 1 year ago.
Category:
Regressions/Crashes
Estimated time:
(Total: 0.00 h)
Description
Observation¶
According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%
Acceptance criteria¶
- AC1: openqaworker-arm-4/5 have a fail-ratio less or equal to arm-1/2/3
Additional information and ideas from the hardware comparison between arm-1/2/3 and arm-4/5¶
- The CPU (the specific model and version, Cavium ThunderX2) of arm-4/5 is known to behave badly for our use-case and that's the difference to the older arm workers (which have the previous version of that CPU model installed).
- Disabling cpu control and cpu frequency scaling in the firmware environment didn't make a difference.
- Before that we've already tried to reduce the number of worker slots a lot and it didn't help either.
- There are still a few ideas to consider (see #109232#note-5).
- There are also more variables in the firmware environment (see #109232#note-20) we can play with.
- Next time we should buy different hardware (see private comment #109232#note-11).
- See the full ticket #109232 for more context about these findings.
Suggestions¶
- Confirm if typing issues cause the failures (look for timeouts, observe additional or missing characters in typed commands)
- Upgrade arm3 to Leap 15.3 and compare failure rate -> #101265 => Leap 15.3 behaves similar as Leap 15.2
- Consider switching to kernel-stable or kernel-head -> #101271 => "kernel-default" from Kernel:stable behaves same as openSUSE:Leap:15.3 one
Consider downgrading kernel to what's used in 15.2 -> same upstream version is running on most
- Bring back arm 4 and 5 after verifying stability
- Run typing.pm from os-autoinst as test in production -> #101262
Files
First investigation shows that we run leap15.2 on the "old" workers and 15.3 on the new ones. Kernel-version seems to be quite different between different workers:
arm-1: 5.8.3-1.gbad027a-default
arm-2: 5.7.12-1.g9c98feb-default
arm-3: 5.3.18-lp152.95-default
arm-4: 5.3.18-59.27-default
arm-5: 5.3.18-59.27-default
given that arm3 has at least a similar kernel version I'd exclude kernel for now.
Kernel cmdline is managed by salt and therefore the same on all 5 machines.
I also tried to diff the sysctl's currently set in the system. Due to different kernels this is a quite tedious task and I didn't see much which could make a difference here. Attaching the diff as html.
network IO seems fine. Higher packet drops can be observed on arm4 & arm5 (exactly the same pattern, so hinting to the switch) but IMHO this shouldn't cause such a performance-hit
- Priority changed from High to Urgent
- Description updated (diff)
- Tracker changed from action to coordination
- Subject changed from Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
- Description updated (diff)
- Copied to action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:M added
- Description updated (diff)
- Description updated (diff)
- Status changed from New to Blocked
- Assignee set to okurz
- Description updated (diff)
- Description updated (diff)
- Status changed from Blocked to New
- Assignee deleted (
okurz)
I updated the epic with the results from #101265 and #101271 . We can now continue defining more hypotheses to follow-up with.
- Subject changed from [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
There's still a lot of failed jobs from the #101271 stress test that should be searched for patterns. Maybe that will give some hints for where to look with followup investigations.
- Status changed from New to Workable
The workers arm-4/5 went offline on 05.12.2021. IPMI still responds so I invoked a power cycle. However, they both workers didn't boot successfully. They've got both stuck in the early boot:
Loading Linux 5.15.5-lp153.2.g83fc974-default ...
Loading initial ramdisk ...
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable
EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary
EFI stub: ERROR: FIRMWARE BUG: Image BSS overlaps adjacent EFI memory region
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...
INFO: Node: 0 :: REP: 0x0, REP-FAIL: 0x0, MBIST: 0x0, MBIST-FAIL: 0x803c3c
INFO: Node: 1 :: REP: 0x0, REP-FAIL: 0x0, MBIST: 0x0, MBIST-FAIL: 0x803c3c
[ 0.000000][ T0] Booting Linux on physical CPU 0x0000000000 [0x431f0af2]
[ 0.000000][ T0] Linux version 5.15.5-lp153.2.g83fc974-default (geeko@buildhost) (gcc (SUSE Linux) 11.2.1 20210816 [revision 056e324ce46a7924b5cf10f61010cf9dd2ca10e9], GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-7.26) #1 SMP Thu Nov 25 09:36:40 UTC 2021 (83fc974)
[ 0.000000][ T0] efi: EFI v2.70 by American Megatrends
[ 0.000000][ T0] efi: ESRT=0xf9515018 SMBIOS=0xfe390000 SMBIOS 3.0=0xfe380000 ACPI 2.0=0xfd8d0000 MOKvar=0xf7bd7000 MEMRESERVE=0xf4801798
[ 0.000000][ T0] esrt: Reserving ESRT space from 0x00000000f9515018 to 0x00000000f9515050.
[ 0.000000][ T0] ACPI: Early table checksum verification disabled
…
[ 39.152724][ T1] pci_bus 0000:80: resource 4 [mem 0x60000000-0x7fffffff window]
[ 39.160282][ T1] pci_bus 0000:80: resource 5 [mem 0x14000000000-0x17fffffffff window]
[ 39.168366][ T1] pci_bus 0000:91: resource 1 [mem 0x60000000-0x600fffff]
[ 39.214998][ T1] iommu: Default domain type: Passthrough
[ 39.220723][ T1] pci 0000:0d:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 39.229763][ T1] pci 0000:0d:00.0: vgaarb: bridge control possible
[ 39.236201][ T1] pci 0000:0d:00.0: vgaarb: setting as boot device (VGA legacy resources not available)
[ 39.245755][ T1] vgaarb: loaded
[ 39.249503][ T1] SCSI subsystem initialized
[ 39.254128][ T1] pps_core: LinuxPPS API ver. 1 registered
[ 39.259785][ T1] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[ 39.269652][ T1] PTP clock support registered
[ 39.274282][ T1] EDAC MC: Ver: 3.0.0
[ 39.278314][ T1] Registered efivars operations
[ 39.287410][ T1] NetLabel: Initializing
[ 39.291497][ T1] NetLabel: domain hash size = 128
[ 39.296541][ T1] NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO
[ 39.302898][ T1] NetLabel: unlabeled traffic allowed by default
<no further log messages>
I removed both workers from salt and paused the host-up alerts.
I also gave power cycling arm-4 a try and for me it ended at a slightly different point:
...
[ 27.159822][ T1] pci_bus 0000:0e: resource 1 [mem 0x43100000-0x432fffff]
[ 27.196956][ T1] ARMH0011:00: ttyAMA0 at MMIO 0x402020000 root bus resource [mem 0x60000000-0x7fffffff window]
[ 38.675918][ T1] pci_bus 0000:80: root bus resource [mem 0x14000000000-0x17fffffffff window]
[ 38.684606][ T1] pci_bus 0000:80: root bus resource [bus 80-ff]
[ 38.690813][ T1] pci 0000:80:00.0: [177d:af00] type 00 class 0x060000
[ 38.697639][ T1] pci 0000:80:01.0: [177d:af84] type 01 class 0x060400
[ 38.704369][ T1] pci 0000:80:01.0: PME# supported from D0 D3hot D3cold
[ 38.711273][ T1] pci 0000:80:02.0: [177d:af84] type 01 class 0x060400
[ 38.718000][ T1] pci 0000:80:02.0: PME# supported from D0 D3hot D3cold
[ 38.724911][ T1] pci 0000:80:03.0: [177d:af84] type 01 class 0x060400
[ 38.731640][ T1] pci 0000:80:03.0: PME# supported from D0 D3hot D3cold
[ 38.738539][ T1] pci 0000:80:04.0: [177d:af84] type 01 class 0x060400
[ 38.745267][ T1] pci 0000:80:04.0: PME# supported from D0 D3hot D3cold
[ 38.752161][ T1] pci 0000:80:05.0: [177d:af84] type 01 class 0x060400
[ 38.758892][ T1] pcrom D0 D3hot D3cold
[ 38.888346][ T1] pci 0000:80:0f.0: [14e4:9026] type 00 class 0x0c0330
[ 38.895048][ T1] pci 0000:80:0f.0: reg 0x10: [mem 0x14000030000-0x1400003ffff 64bit pref]
[ 38.903479][ T1] pci 0000:80:0f.0: reg 0x18: [mem 0x14000020000-0x1400002ffff 64bit pref]
[ 38.911999][ T1] pci 0000:80:0f.1: [14e4:9026] type 00 class 0x0c0330
[ 38.918695][ T1] pci 0000:80:0f.1: reg 0x10: [mem 0x14000010000-0x1400001ffff 64bit pref]
[ 38.927129][ T1] pci 0000:80:0f.1: reg 0x18: [mem 0x14000000000-0x1400000ffff 64bit pref]
[ 38.935693][ T1] acpiphp: Slot [1] registered
[ 38.940380][ T1] acpiphp: Slot [1-1] registered
[ 38.945240][ T1] acpiphp: Slot [1-2] registered
[ 38.950096][ T1] acpiphp: Slot [1-3] registered
[ 38.954941][ T1] pci 0000:91:00.0: [8086:0a54] type 00 class 0x010802
[ 38.961645][ T1] pci 0000:91:00.0: reg 0x10: [mem 0x60000000-0x60003fff 64bit]
[ 38.969148][ T1] pci 0000:91:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pr
<no further log messages>
The machine does boot with the 5.14.14 kernel though. Upgrading to 5.15.10 did not work, gets stuck at the same point during boot.
Downgraded both machines to the default Leap 15.3 kernel, so they are working again.
- Status changed from Workable to Blocked
- Assignee set to okurz
- Status changed from Blocked to New
- Assignee deleted (
okurz)
all current subtasks resolved. We need to brainstorm again together what to do.
Yes. The outcome of #104304 is that os-autoinst's fullstack test is not sufficient to find any difference between arm-1/2/3 and 4/5.
The last time we spoke about we thought of the idea to involve ARM experts. okurz asked ggardet_arm in https://app.element.io/#/room/#openqa:opensuse.org (or maybe it was #opensuse-factory ) and he offered help but needs more details about the machines. So I suggest to get details, e.g. log in, call dmesg and dmidecode and provide that details in the ticket and ask ggardet_arm again. Maybe something about hugepages, cpu flags, some boot kernel parameters to work around I/O quirks, anything like that. We created a new specific suggestion in a subtask.
- Status changed from New to Blocked
- Parent task set to #109743
- Description updated (diff)
- Status changed from Blocked to New
- Assignee deleted (
mkittler)
- Target version changed from Ready to future
- Parent task changed from #109743 to #121732
Also available in: Atom
PDF