Project

General

Profile

action #115547

Updated by livdywan about 2 years ago

## Motivation 

 Today I noticed that openqaworker20 was MIA. It didn't respond to ping and the BMC revealed that it got stuck *really* early during boot, on the BIOS splash screen! 
 The phase it got stuck in was "DXE--SB Initialization". A reset helped, but after loading the kernel and initrd the system crashed again and got stuck on the BIOS splash, this time during "PCI resource allocation". I did a power cycle and turned off "Quiet boot" in the BIOS settings for good measure and added `verbose debug` to the kernel cmdline. 

 Unfortunately it crashes in a rather bad place: 

 ``` 
 [      2.794996][      T1] smpboot: CPU0: AMD EPYC 7543P 32-Core Processor (family: 0x19, model: 0x1, stepping: 0x1) 
 [      2.798620][      T1] Performance Events: Fam17h+ core perfctr, AMD PMU driver. 
 [      2.802523][      T1] ... version:                  0 
 [      2.806522][      T1] ... bit width:                48 
 [      2.810522][      T1] ... generic registers:        6 
 [      2.814522][      T1] ... value mask:               0000ffffffffffff 
 [      2.818522][      T1] ... max period:               00007fffffffffff 
 [      2.822522][      T1] ... fixed-purpose events:     0 
 [      2.826522][      T1] ... event mask:               000000000000003f 
 [      2.830576][      T1] rcu: Hierarchical SRCU implementation. 
 [      2.834789][      T7] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. 
 [      2.838797][      T1] smp: Bringing up secondary CPUs ... 
 [      2.842584][      T1] x86: Booting SMP configuration: 
 (stuck here for ~10s, then reset) 
 ``` 

 Booting the older `5.14.21-150400.22-default` kernel doesn't work either. 

 ## Acceptance criteria 
 * **AC1**: openqaworker20 is back in business 
 * **AC2**: openqaworker20 business; it is setup as an o3 worker in the way same way as openqaworker19 is. 

 ## Suggestions 
 - Disabling some of the CPU's seems to work around the issue, so maybe this is a hardware fault in one or more of the CPU's 
 - Run a memory test 
 - Consider updating the BIOS/ firmware 
 - Visit the servere room, or find someone with access who can examine the machine 
 - Contact the vendor, likely Delta Computers to take the machine back 

Back