Project

General

Profile

coordination #101048

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Added by okurz 7 months ago. Updated 2 days ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-10-15
Due date:
2022-06-10
% Done:

60%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Observation

According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%

Acceptance criteria

  • AC1: openqaworker-arm-4/5 have a fail-ratio less or equal to arm-1/2/3

Additional information and ideas from the hardware comparison between arm-1/2/3 and arm-4/5

  • The CPU (the specific model and version, Cavium ThunderX2) of arm-4/5 is known to behave badly for our use-case and that's the difference to the older arm workers (which have the previous version of that CPU model installed).
  • Disabling cpu control and cpu frequency scaling in the firmware environment didn't make a difference.
    • Before that we've already tried to reduce the number of worker slots a lot and it didn't help either.
    • There are still a few ideas to consider (see #109232#note-5).
    • There are also more variables in the firmware environment (see #109232#note-20) we can play with.
  • Next time we should buy different hardware (see private comment #109232#note-11).
  • See the full ticket #109232 for more context about these findings.

Suggestions

  • Confirm if typing issues cause the failures (look for timeouts, observe additional or missing characters in typed commands)
  • Upgrade arm3 to Leap 15.3 and compare failure rate -> #101265 => Leap 15.3 behaves similar as Leap 15.2
  • Consider switching to kernel-stable or kernel-head -> #101271 => "kernel-default" from Kernel:stable behaves same as openSUSE:Leap:15.3 one
  • Consider downgrading kernel to what's used in 15.2 -> same upstream version is running on most
  • Bring back arm 4 and 5 after verifying stability
  • Run typing.pm from os-autoinst as test in production -> #101262
sysctl_diff.html (39.3 KB) sysctl_diff.html arm4 left, arm3 right nicksinger, 2021-10-18 11:36

Subtasks

action #101262: Document running os-autoinst full-stack.t on OSD workers size:MResolvedokurz

action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:MResolvedmkittler

openQA Infrastructure - action #101271: Try Kernel:stable on arm4+arm5 and compare failure rate size:MResolvedkraih

openQA Infrastructure - action #104304: Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" size:MResolvedmkittler

action #109232: Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:MResolvedmkittler

openQA Infrastructure - action #109494: Restore network connection of arm-4/5 size:MResolvednicksinger

openQA Infrastructure - action #110539: Ask OBS team if they would like to swap ARM workers with usNew

action #110542: Try to mitigate "VNC typing issues" with disabled key repeatNew

openQA Infrastructure - action #110545: Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:MBlockedokurz

openQA Infrastructure - action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545Newokurz


Related issues

Related to openQA Project - action #101030: Typing problems on aarch64Resolved2021-10-15

History

#1 Updated by okurz 7 months ago

#2 Updated by nicksinger 7 months ago

First investigation shows that we run leap15.2 on the "old" workers and 15.3 on the new ones. Kernel-version seems to be quite different between different workers:
arm-1: 5.8.3-1.gbad027a-default
arm-2: 5.7.12-1.g9c98feb-default
arm-3: 5.3.18-lp152.95-default
arm-4: 5.3.18-59.27-default
arm-5: 5.3.18-59.27-default

given that arm3 has at least a similar kernel version I'd exclude kernel for now.
Kernel cmdline is managed by salt and therefore the same on all 5 machines.

I also tried to diff the sysctl's currently set in the system. Due to different kernels this is a quite tedious task and I didn't see much which could make a difference here. Attaching the diff as html.

#3 Updated by nicksinger 7 months ago

despite being way less utilized arm4 sees load-spikes up to 75 and is around 25 quite constantly according to: https://monitor.qa.suse.de/d/WDopenqaworker-arm-4/worker-dashboard-openqaworker-arm-4?viewPanel=54694&orgId=1&from=1633952613651&to=1634557413651 - this could hint to IO performance issues.

#5 Updated by nicksinger 7 months ago

network IO seems fine. Higher packet drops can be observed on arm4 & arm5 (exactly the same pattern, so hinting to the switch) but IMHO this shouldn't cause such a performance-hit

#6 Updated by okurz 7 months ago

  • Priority changed from High to Urgent

#7 Updated by cdywan 7 months ago

  • Description updated (diff)

#8 Updated by cdywan 7 months ago

  • Tracker changed from action to coordination
  • Subject changed from Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
  • Description updated (diff)

#9 Updated by cdywan 7 months ago

  • Copied to action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:M added

#10 Updated by okurz 7 months ago

  • Description updated (diff)

#11 Updated by okurz 7 months ago

  • Description updated (diff)

#12 Updated by okurz 7 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

blocked by subtasks

#13 Updated by okurz 6 months ago

  • Description updated (diff)

#14 Updated by okurz 6 months ago

  • Description updated (diff)

#15 Updated by okurz 6 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

I updated the epic with the results from #101265 and #101271 . We can now continue defining more hypotheses to follow-up with.

#16 Updated by okurz 6 months ago

  • Subject changed from [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

#17 Updated by kraih 6 months ago

There's still a lot of failed jobs from the #101271 stress test that should be searched for patterns. Maybe that will give some hints for where to look with followup investigations.

#18 Updated by okurz 6 months ago

  • Status changed from New to Workable

#20 Updated by mkittler 6 months ago

The workers arm-4/5 went offline on 05.12.2021. IPMI still responds so I invoked a power cycle. However, they both workers didn't boot successfully. They've got both stuck in the early boot:

Loading Linux 5.15.5-lp153.2.g83fc974-default ...
Loading initial ramdisk ...
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable
EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary
EFI stub: ERROR: FIRMWARE BUG: Image BSS overlaps adjacent EFI memory region
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...
INFO:    Node: 0 :: REP: 0x0, REP-FAIL: 0x0, MBIST: 0x0, MBIST-FAIL: 0x803c3c
INFO:    Node: 1 :: REP: 0x0, REP-FAIL: 0x0, MBIST: 0x0, MBIST-FAIL: 0x803c3c
[    0.000000][    T0] Booting Linux on physical CPU 0x0000000000 [0x431f0af2]
[    0.000000][    T0] Linux version 5.15.5-lp153.2.g83fc974-default (geeko@buildhost) (gcc (SUSE Linux) 11.2.1 20210816 [revision 056e324ce46a7924b5cf10f61010cf9dd2ca10e9], GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-7.26) #1 SMP Thu Nov 25 09:36:40 UTC 2021 (83fc974)
[    0.000000][    T0] efi: EFI v2.70 by American Megatrends
[    0.000000][    T0] efi: ESRT=0xf9515018 SMBIOS=0xfe390000 SMBIOS 3.0=0xfe380000 ACPI 2.0=0xfd8d0000 MOKvar=0xf7bd7000 MEMRESERVE=0xf4801798 
[    0.000000][    T0] esrt: Reserving ESRT space from 0x00000000f9515018 to 0x00000000f9515050.
[    0.000000][    T0] ACPI: Early table checksum verification disabled
…
[   39.152724][    T1] pci_bus 0000:80: resource 4 [mem 0x60000000-0x7fffffff window]
[   39.160282][    T1] pci_bus 0000:80: resource 5 [mem 0x14000000000-0x17fffffffff window]
[   39.168366][    T1] pci_bus 0000:91: resource 1 [mem 0x60000000-0x600fffff]
[   39.214998][    T1] iommu: Default domain type: Passthrough 
[   39.220723][    T1] pci 0000:0d:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[   39.229763][    T1] pci 0000:0d:00.0: vgaarb: bridge control possible
[   39.236201][    T1] pci 0000:0d:00.0: vgaarb: setting as boot device (VGA legacy resources not available)
[   39.245755][    T1] vgaarb: loaded
[   39.249503][    T1] SCSI subsystem initialized
[   39.254128][    T1] pps_core: LinuxPPS API ver. 1 registered
[   39.259785][    T1] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[   39.269652][    T1] PTP clock support registered
[   39.274282][    T1] EDAC MC: Ver: 3.0.0
[   39.278314][    T1] Registered efivars operations
[   39.287410][    T1] NetLabel: Initializing
[   39.291497][    T1] NetLabel:  domain hash size = 128
[   39.296541][    T1] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[   39.302898][    T1] NetLabel:  unlabeled traffic allowed by default
<no further log messages>

I removed both workers from salt and paused the host-up alerts.

#21 Updated by kraih 5 months ago

I also gave power cycling arm-4 a try and for me it ended at a slightly different point:

...
[   27.159822][    T1] pci_bus 0000:0e: resource 1 [mem 0x43100000-0x432fffff]
[   27.196956][    T1] ARMH0011:00: ttyAMA0 at MMIO 0x402020000 root bus resource [mem 0x60000000-0x7fffffff window]
[   38.675918][    T1] pci_bus 0000:80: root bus resource [mem 0x14000000000-0x17fffffffff window]
[   38.684606][    T1] pci_bus 0000:80: root bus resource [bus 80-ff]
[   38.690813][    T1] pci 0000:80:00.0: [177d:af00] type 00 class 0x060000
[   38.697639][    T1] pci 0000:80:01.0: [177d:af84] type 01 class 0x060400
[   38.704369][    T1] pci 0000:80:01.0: PME# supported from D0 D3hot D3cold
[   38.711273][    T1] pci 0000:80:02.0: [177d:af84] type 01 class 0x060400
[   38.718000][    T1] pci 0000:80:02.0: PME# supported from D0 D3hot D3cold
[   38.724911][    T1] pci 0000:80:03.0: [177d:af84] type 01 class 0x060400
[   38.731640][    T1] pci 0000:80:03.0: PME# supported from D0 D3hot D3cold
[   38.738539][    T1] pci 0000:80:04.0: [177d:af84] type 01 class 0x060400
[   38.745267][    T1] pci 0000:80:04.0: PME# supported from D0 D3hot D3cold
[   38.752161][    T1] pci 0000:80:05.0: [177d:af84] type 01 class 0x060400
[   38.758892][    T1] pcrom D0 D3hot D3cold
[   38.888346][    T1] pci 0000:80:0f.0: [14e4:9026] type 00 class 0x0c0330
[   38.895048][    T1] pci 0000:80:0f.0: reg 0x10: [mem 0x14000030000-0x1400003ffff 64bit pref]
[   38.903479][    T1] pci 0000:80:0f.0: reg 0x18: [mem 0x14000020000-0x1400002ffff 64bit pref]
[   38.911999][    T1] pci 0000:80:0f.1: [14e4:9026] type 00 class 0x0c0330
[   38.918695][    T1] pci 0000:80:0f.1: reg 0x10: [mem 0x14000010000-0x1400001ffff 64bit pref]
[   38.927129][    T1] pci 0000:80:0f.1: reg 0x18: [mem 0x14000000000-0x1400000ffff 64bit pref]
[   38.935693][    T1] acpiphp: Slot [1] registered
[   38.940380][    T1] acpiphp: Slot [1-1] registered
[   38.945240][    T1] acpiphp: Slot [1-2] registered
[   38.950096][    T1] acpiphp: Slot [1-3] registered
[   38.954941][    T1] pci 0000:91:00.0: [8086:0a54] type 00 class 0x010802
[   38.961645][    T1] pci 0000:91:00.0: reg 0x10: [mem 0x60000000-0x60003fff 64bit]
[   38.969148][    T1] pci 0000:91:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pr
<no further log messages>

The machine does boot with the 5.14.14 kernel though. Upgrading to 5.15.10 did not work, gets stuck at the same point during boot.

#22 Updated by kraih 5 months ago

Downgraded both machines to the default Leap 15.3 kernel, so they are working again.

#23 Updated by okurz 5 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

#24 Updated by okurz 4 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

all current subtasks resolved. We need to brainstorm again together what to do.

#25 Updated by mkittler 4 months ago

Yes. The outcome of #104304 is that os-autoinst's fullstack test is not sufficient to find any difference between arm-1/2/3 and 4/5.

#26 Updated by okurz about 2 months ago

The last time we spoke about we thought of the idea to involve ARM experts. okurz asked ggardet_arm in https://app.element.io/#/room/#openqa:opensuse.org (or maybe it was #opensuse-factory ) and he offered help but needs more details about the machines. So I suggest to get details, e.g. log in, call dmesg and dmidecode and provide that details in the ticket and ask ggardet_arm again. Maybe something about hugepages, cpu flags, some boot kernel parameters to work around I/O quirks, anything like that. We created a new specific suggestion in a subtask.

#27 Updated by mkittler about 2 months ago

  • Assignee set to mkittler

#28 Updated by okurz about 2 months ago

  • Status changed from New to Blocked

#29 Updated by okurz about 2 months ago

  • Parent task set to #109743

#30 Updated by mkittler 25 days ago

  • Description updated (diff)

Also available in: Atom PDF