action #115547
Updated by livdywan over 2 years ago
Today I noticed that openqaworker20 was MIA. It didn't respond to ping and the BMC revealed that it got stuck *really* early during boot, on the BIOS splash screen!
The phase it got stuck in was "DXE--SB Initialization". A reset helped, but after loading the kernel and initrd the system crashed again and got stuck on the BIOS splash, this time during "PCI resource allocation". I did a power cycle and turned off "Quiet boot" in the BIOS settings for good measure and added `verbose debug` to the kernel cmdline.
Unfortunately it crashes in a rather bad place:
```
[ 2.794996][ T1] smpboot: CPU0: AMD EPYC 7543P 32-Core Processor (family: 0x19, model: 0x1, stepping: 0x1)
[ 2.798620][ T1] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 2.802523][ T1] ... version: 0
[ 2.806522][ T1] ... bit width: 48
[ 2.810522][ T1] ... generic registers: 6
[ 2.814522][ T1] ... value mask: 0000ffffffffffff
[ 2.818522][ T1] ... max period: 00007fffffffffff
[ 2.822522][ T1] ... fixed-purpose events: 0
[ 2.826522][ T1] ... event mask: 000000000000003f
[ 2.830576][ T1] rcu: Hierarchical SRCU implementation.
[ 2.834789][ T7] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[ 2.838797][ T1] smp: Bringing up secondary CPUs ...
[ 2.842584][ T1] x86: Booting SMP configuration:
(stuck here for ~10s, then reset)
```
Booting the older `5.14.21-150400.22-default` kernel doesn't work either.
# Suggestions
- Disabling some of the CPU's seems to work around the issue, so maybe this is a hardware fault in one or more of the CPU's
- Run a memory test
- Consider updating the BIOS/ firmware
- Visit the servere room, or find someone with access who can examine the machine
- Contact the vendor, likely Delta Computers to take the machine back