action #115547
Updated by livdywan about 2 years ago
## Motivation Today I noticed that openqaworker20 was MIA. It didn't respond to ping and the BMC revealed that it got stuck *really* early during boot, on the BIOS splash screen! The phase it got stuck in was "DXE--SB Initialization". A reset helped, but after loading the kernel and initrd the system crashed again and got stuck on the BIOS splash, this time during "PCI resource allocation". I did a power cycle and turned off "Quiet boot" in the BIOS settings for good measure and added `verbose debug` to the kernel cmdline. Unfortunately it crashes in a rather bad place: ``` [ 2.794996][ T1] smpboot: CPU0: AMD EPYC 7543P 32-Core Processor (family: 0x19, model: 0x1, stepping: 0x1) [ 2.798620][ T1] Performance Events: Fam17h+ core perfctr, AMD PMU driver. [ 2.802523][ T1] ... version: 0 [ 2.806522][ T1] ... bit width: 48 [ 2.810522][ T1] ... generic registers: 6 [ 2.814522][ T1] ... value mask: 0000ffffffffffff [ 2.818522][ T1] ... max period: 00007fffffffffff [ 2.822522][ T1] ... fixed-purpose events: 0 [ 2.826522][ T1] ... event mask: 000000000000003f [ 2.830576][ T1] rcu: Hierarchical SRCU implementation. [ 2.834789][ T7] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. [ 2.838797][ T1] smp: Bringing up secondary CPUs ... [ 2.842584][ T1] x86: Booting SMP configuration: (stuck here for ~10s, then reset) ``` Booting the older `5.14.21-150400.22-default` kernel doesn't work either. ## Acceptance criteria * **AC1**: openqaworker20 is back in business * **AC2**: openqaworker20 business; it is setup as an o3 worker in the way same way as openqaworker19 is. ## Suggestions - Disabling some of the CPU's seems to work around the issue, so maybe this is a hardware fault in one or more of the CPU's - Run a memory test - Consider updating the BIOS/ firmware - Visit the servere room, or find someone with access who can examine the machine - Contact the vendor, likely Delta Computers to take the machine back