QA (public) &raquo; openQA Project (public)

2021-10-21

action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:M

Resolved

mkittler

2021-10-15

openQA Infrastructure (public) - action #101271: Try Kernel:stable on arm4+arm5 and compare failure rate size:M

Resolved

kraih

2021-10-15

openQA Infrastructure (public) - action #104304: Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" size:M

Resolved

mkittler

2021-12-22

action #109232: Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:M

Resolved

mkittler

2022-03-30

openQA Infrastructure (public) - action #109494: Restore network connection of arm-4/5 size:M

Resolved

nicksinger

2022-04-05

openQA Infrastructure (public) - action #110539: Ask OBS team if they would like to swap ARM workers with us

Resolved

2022-05-02

action #110542: Try to mitigate "VNC typing issues" with disabled key repeat

Resolved

2022-05-02

openQA Infrastructure (public) - action #110545: Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:M

Workable

2022-05-02

openQA Infrastructure (public) - action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545 size:M

Resolved

nicksinger

action #113441: Try to mitigate "VNC typing issues" with disabled key repeat in linux tty's of qemu tests

New

Related issues 1 (0 open — 1 closed)

Related to openQA Project (public) - action #101030: Typing problems on aarch64

Resolved

2021-10-15

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Updated by okurz over 3 years ago

Related to action #101030: Typing problems on aarch64 added

Actions

Updated by nicksinger over 3 years ago

File sysctl_diff.html sysctl_diff.html added

First investigation shows that we run leap15.2 on the "old" workers and 15.3 on the new ones. Kernel-version seems to be quite different between different workers:
arm-1: 5.8.3-1.gbad027a-default
arm-2: 5.7.12-1.g9c98feb-default
arm-3: 5.3.18-lp152.95-default
arm-4: 5.3.18-59.27-default
arm-5: 5.3.18-59.27-default

given that arm3 has at least a similar kernel version I'd exclude kernel for now.
Kernel cmdline is managed by salt and therefore the same on all 5 machines.

I also tried to diff the sysctl's currently set in the system. Due to different kernels this is a quite tedious task and I didn't see much which could make a difference here. Attaching the diff as html.

Actions

Updated by nicksinger over 3 years ago

despite being way less utilized arm4 sees load-spikes up to 75 and is around 25 quite constantly according to: https://monitor.qa.suse.de/d/WDopenqaworker-arm-4/worker-dashboard-openqaworker-arm-4?viewPanel=54694&orgId=1&from=1633952613651&to=1634557413651 - this could hint to IO performance issues.

Actions

Updated by nicksinger over 3 years ago

Disk IO is 10x less compared to arm3 on arm4: https://monitor.qa.suse.de/d/WDopenqaworker-arm-4/worker-dashboard-openqaworker-arm-4?viewPanel=13782&orgId=1&from=1633953007558&to=1634557807558 (expected, less load) but IO response times seem to be ~50% worse according to https://monitor.qa.suse.de/d/WDopenqaworker-arm-4/worker-dashboard-openqaworker-arm-4?viewPanel=56720&orgId=1&from=1633953007558&to=1634557807558

Actions

Updated by nicksinger over 3 years ago

network IO seems fine. Higher packet drops can be observed on arm4 & arm5 (exactly the same pattern, so hinting to the switch) but IMHO this shouldn't cause such a performance-hit

Actions

Updated by okurz over 3 years ago

Priority changed from High to Urgent

Actions

Updated by livdywan over 3 years ago

Description updated (diff)

Actions

Updated by livdywan over 3 years ago

Tracker changed from action to coordination
Subject changed from Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Description updated (diff)

Actions

Updated by livdywan over 3 years ago

Copied to action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:M added

Actions

#10

Updated by okurz over 3 years ago

Description updated (diff)

Actions

#11

Updated by okurz over 3 years ago

Description updated (diff)

Actions

#12

Updated by okurz over 3 years ago

Status changed from New to Blocked
Assignee set to okurz

blocked by subtasks

Actions

#13

Updated by okurz about 3 years ago

Description updated (diff)

Actions

#14

Updated by okurz about 3 years ago

Description updated (diff)

Actions

#15

Updated by okurz about 3 years ago

Status changed from Blocked to New
Assignee deleted (~~okurz~~)

I updated the epic with the results from #101265 and #101271 . We can now continue defining more hypotheses to follow-up with.

Actions

#16

Updated by okurz about 3 years ago

Subject changed from [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Actions

#17

Updated by kraih about 3 years ago

There's still a lot of failed jobs from the #101271 stress test that should be searched for patterns. Maybe that will give some hints for where to look with followup investigations.

Actions

#18

Updated by okurz about 3 years ago

Status changed from New to Workable

Actions

#20

Updated by mkittler about 3 years ago

The workers arm-4/5 went offline on 05.12.2021. IPMI still responds so I invoked a power cycle. However, they both workers didn't boot successfully. They've got both stuck in the early boot:

Loading Linux 5.15.5-lp153.2.g83fc974-default ...
Loading initial ramdisk ...
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable
EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary
EFI stub: ERROR: FIRMWARE BUG: Image BSS overlaps adjacent EFI memory region
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...
INFO:    Node: 0 :: REP: 0x0, REP-FAIL: 0x0, MBIST: 0x0, MBIST-FAIL: 0x803c3c
INFO:    Node: 1 :: REP: 0x0, REP-FAIL: 0x0, MBIST: 0x0, MBIST-FAIL: 0x803c3c
[    0.000000][    T0] Booting Linux on physical CPU 0x0000000000 [0x431f0af2]
[    0.000000][    T0] Linux version 5.15.5-lp153.2.g83fc974-default (geeko@buildhost) (gcc (SUSE Linux) 11.2.1 20210816 [revision 056e324ce46a7924b5cf10f61010cf9dd2ca10e9], GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-7.26) #1 SMP Thu Nov 25 09:36:40 UTC 2021 (83fc974)
[    0.000000][    T0] efi: EFI v2.70 by American Megatrends
[    0.000000][    T0] efi: ESRT=0xf9515018 SMBIOS=0xfe390000 SMBIOS 3.0=0xfe380000 ACPI 2.0=0xfd8d0000 MOKvar=0xf7bd7000 MEMRESERVE=0xf4801798 
[    0.000000][    T0] esrt: Reserving ESRT space from 0x00000000f9515018 to 0x00000000f9515050.
[    0.000000][    T0] ACPI: Early table checksum verification disabled
…
[   39.152724][    T1] pci_bus 0000:80: resource 4 [mem 0x60000000-0x7fffffff window]
[   39.160282][    T1] pci_bus 0000:80: resource 5 [mem 0x14000000000-0x17fffffffff window]
[   39.168366][    T1] pci_bus 0000:91: resource 1 [mem 0x60000000-0x600fffff]
[   39.214998][    T1] iommu: Default domain type: Passthrough 
[   39.220723][    T1] pci 0000:0d:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[   39.229763][    T1] pci 0000:0d:00.0: vgaarb: bridge control possible
[   39.236201][    T1] pci 0000:0d:00.0: vgaarb: setting as boot device (VGA legacy resources not available)
[   39.245755][    T1] vgaarb: loaded
[   39.249503][    T1] SCSI subsystem initialized
[   39.254128][    T1] pps_core: LinuxPPS API ver. 1 registered
[   39.259785][    T1] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[   39.269652][    T1] PTP clock support registered
[   39.274282][    T1] EDAC MC: Ver: 3.0.0
[   39.278314][    T1] Registered efivars operations
[   39.287410][    T1] NetLabel: Initializing
[   39.291497][    T1] NetLabel:  domain hash size = 128
[   39.296541][    T1] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[   39.302898][    T1] NetLabel:  unlabeled traffic allowed by default
<no further log messages>

I removed both workers from salt and paused the host-up alerts.

Actions

#21

Updated by kraih about 3 years ago

I also gave power cycling arm-4 a try and for me it ended at a slightly different point:

...
[   27.159822][    T1] pci_bus 0000:0e: resource 1 [mem 0x43100000-0x432fffff]
[   27.196956][    T1] ARMH0011:00: ttyAMA0 at MMIO 0x402020000 root bus resource [mem 0x60000000-0x7fffffff window]
[   38.675918][    T1] pci_bus 0000:80: root bus resource [mem 0x14000000000-0x17fffffffff window]
[   38.684606][    T1] pci_bus 0000:80: root bus resource [bus 80-ff]
[   38.690813][    T1] pci 0000:80:00.0: [177d:af00] type 00 class 0x060000
[   38.697639][    T1] pci 0000:80:01.0: [177d:af84] type 01 class 0x060400
[   38.704369][    T1] pci 0000:80:01.0: PME# supported from D0 D3hot D3cold
[   38.711273][    T1] pci 0000:80:02.0: [177d:af84] type 01 class 0x060400
[   38.718000][    T1] pci 0000:80:02.0: PME# supported from D0 D3hot D3cold
[   38.724911][    T1] pci 0000:80:03.0: [177d:af84] type 01 class 0x060400
[   38.731640][    T1] pci 0000:80:03.0: PME# supported from D0 D3hot D3cold
[   38.738539][    T1] pci 0000:80:04.0: [177d:af84] type 01 class 0x060400
[   38.745267][    T1] pci 0000:80:04.0: PME# supported from D0 D3hot D3cold
[   38.752161][    T1] pci 0000:80:05.0: [177d:af84] type 01 class 0x060400
[   38.758892][    T1] pcrom D0 D3hot D3cold
[   38.888346][    T1] pci 0000:80:0f.0: [14e4:9026] type 00 class 0x0c0330
[   38.895048][    T1] pci 0000:80:0f.0: reg 0x10: [mem 0x14000030000-0x1400003ffff 64bit pref]
[   38.903479][    T1] pci 0000:80:0f.0: reg 0x18: [mem 0x14000020000-0x1400002ffff 64bit pref]
[   38.911999][    T1] pci 0000:80:0f.1: [14e4:9026] type 00 class 0x0c0330
[   38.918695][    T1] pci 0000:80:0f.1: reg 0x10: [mem 0x14000010000-0x1400001ffff 64bit pref]
[   38.927129][    T1] pci 0000:80:0f.1: reg 0x18: [mem 0x14000000000-0x1400000ffff 64bit pref]
[   38.935693][    T1] acpiphp: Slot [1] registered
[   38.940380][    T1] acpiphp: Slot [1-1] registered
[   38.945240][    T1] acpiphp: Slot [1-2] registered
[   38.950096][    T1] acpiphp: Slot [1-3] registered
[   38.954941][    T1] pci 0000:91:00.0: [8086:0a54] type 00 class 0x010802
[   38.961645][    T1] pci 0000:91:00.0: reg 0x10: [mem 0x60000000-0x60003fff 64bit]
[   38.969148][    T1] pci 0000:91:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pr
<no further log messages>

The machine does boot with the 5.14.14 kernel though. Upgrading to 5.15.10 did not work, gets stuck at the same point during boot.

Actions

#22

Updated by kraih about 3 years ago

Downgraded both machines to the default Leap 15.3 kernel, so they are working again.

Actions

#23

Updated by okurz about 3 years ago

Status changed from Workable to Blocked
Assignee set to okurz

Actions

#24

Updated by okurz about 3 years ago

Status changed from Blocked to New
Assignee deleted (~~okurz~~)

all current subtasks resolved. We need to brainstorm again together what to do.

Actions

#25

Updated by mkittler almost 3 years ago

Yes. The outcome of #104304 is that os-autoinst's fullstack test is not sufficient to find any difference between arm-1/2/3 and 4/5.

Actions

#26

Updated by okurz almost 3 years ago

The last time we spoke about we thought of the idea to involve ARM experts. okurz asked ggardet_arm in https://app.element.io/#/room/#openqa:opensuse.org (or maybe it was #opensuse-factory ) and he offered help but needs more details about the machines. So I suggest to get details, e.g. log in, call dmesg and dmidecode and provide that details in the ticket and ask ggardet_arm again. Maybe something about hugepages, cpu flags, some boot kernel parameters to work around I/O quirks, anything like that. We created a new specific suggestion in a subtask.

Actions