Project

General

Profile

Actions

action #115547

closed

openqaworker20 fails to boot, broken hardware size:M

Added by favogt over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-08-19
Due date:
% Done:

100%

Estimated time:

Description

Motivation

Today I noticed that openqaworker20 was MIA. It didn't respond to ping and the BMC revealed that it got stuck really early during boot, on the BIOS splash screen!
The phase it got stuck in was "DXE--SB Initialization". A reset helped, but after loading the kernel and initrd the system crashed again and got stuck on the BIOS splash, this time during "PCI resource allocation". I did a power cycle and turned off "Quiet boot" in the BIOS settings for good measure and added verbose debug to the kernel cmdline.

Unfortunately it crashes in a rather bad place:

[    2.794996][    T1] smpboot: CPU0: AMD EPYC 7543P 32-Core Processor (family: 0x19, model: 0x1, stepping: 0x1)
[    2.798620][    T1] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[    2.802523][    T1] ... version:                0
[    2.806522][    T1] ... bit width:              48
[    2.810522][    T1] ... generic registers:      6
[    2.814522][    T1] ... value mask:             0000ffffffffffff
[    2.818522][    T1] ... max period:             00007fffffffffff
[    2.822522][    T1] ... fixed-purpose events:   0
[    2.826522][    T1] ... event mask:             000000000000003f
[    2.830576][    T1] rcu: Hierarchical SRCU implementation.
[    2.834789][    T7] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    2.838797][    T1] smp: Bringing up secondary CPUs ...
[    2.842584][    T1] x86: Booting SMP configuration:
(stuck here for ~10s, then reset)

Booting the older 5.14.21-150400.22-default kernel doesn't work either.

Acceptance criteria

  • AC1: openqaworker20 is back in business
  • AC2: openqaworker20 is setup as an o3 worker in the same way as openqaworker19

Suggestions

  • Disabling some of the CPU's seems to work around the issue, so maybe this is a hardware fault in one or more of the CPU's
  • Run a memory test
  • Consider updating the BIOS/ firmware
  • Visit the servere room, or find someone with access who can examine the machine
  • Contact the vendor, likely Delta Computers to take the machine back

Files

ow20-events.webm (390 KB) ow20-events.webm mkittler, 2022-08-25 13:20
20221011_153506-1.jpg (417 KB) 20221011_153506-1.jpg mgriessmeier, 2022-10-11 13:38

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #111473: Get replacements for imagetester and openqaworker1 size:MResolvedmkittler2022-05-23

Actions
Related to openQA Infrastructure - action #115418: Setup ow19+20 to be able to run MM tests size:MResolvedfavogt2022-08-17

Actions
Actions

Also available in: Atom PDF