Project

General

Profile

Actions

action #115547

closed

openqaworker20 fails to boot, broken hardware size:M

Added by favogt over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-08-19
Due date:
% Done:

100%

Estimated time:

Description

Motivation

Today I noticed that openqaworker20 was MIA. It didn't respond to ping and the BMC revealed that it got stuck really early during boot, on the BIOS splash screen!
The phase it got stuck in was "DXE--SB Initialization". A reset helped, but after loading the kernel and initrd the system crashed again and got stuck on the BIOS splash, this time during "PCI resource allocation". I did a power cycle and turned off "Quiet boot" in the BIOS settings for good measure and added verbose debug to the kernel cmdline.

Unfortunately it crashes in a rather bad place:

[    2.794996][    T1] smpboot: CPU0: AMD EPYC 7543P 32-Core Processor (family: 0x19, model: 0x1, stepping: 0x1)
[    2.798620][    T1] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[    2.802523][    T1] ... version:                0
[    2.806522][    T1] ... bit width:              48
[    2.810522][    T1] ... generic registers:      6
[    2.814522][    T1] ... value mask:             0000ffffffffffff
[    2.818522][    T1] ... max period:             00007fffffffffff
[    2.822522][    T1] ... fixed-purpose events:   0
[    2.826522][    T1] ... event mask:             000000000000003f
[    2.830576][    T1] rcu: Hierarchical SRCU implementation.
[    2.834789][    T7] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    2.838797][    T1] smp: Bringing up secondary CPUs ...
[    2.842584][    T1] x86: Booting SMP configuration:
(stuck here for ~10s, then reset)

Booting the older 5.14.21-150400.22-default kernel doesn't work either.

Acceptance criteria

  • AC1: openqaworker20 is back in business
  • AC2: openqaworker20 is setup as an o3 worker in the same way as openqaworker19

Suggestions

  • Disabling some of the CPU's seems to work around the issue, so maybe this is a hardware fault in one or more of the CPU's
  • Run a memory test
  • Consider updating the BIOS/ firmware
  • Visit the servere room, or find someone with access who can examine the machine
  • Contact the vendor, likely Delta Computers to take the machine back

Files

ow20-events.webm (390 KB) ow20-events.webm mkittler, 2022-08-25 13:20
20221011_153506-1.jpg (417 KB) 20221011_153506-1.jpg mgriessmeier, 2022-10-11 13:38

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #111473: Get replacements for imagetester and openqaworker1 size:MResolvedmkittler2022-05-23

Actions
Related to openQA Infrastructure - action #115418: Setup ow19+20 to be able to run MM tests size:MResolvedfavogt2022-08-17

Actions
Actions #1

Updated by favogt over 1 year ago

  • Related to action #111473: Get replacements for imagetester and openqaworker1 size:M added
Actions #2

Updated by favogt over 1 year ago

I also noticed that the BMC Web UI is unable to report sensor readings and reports the fan states as "critical". On ow19, temperature and fan speed readings are available.

Actions #3

Updated by livdywan over 1 year ago

  • Related to action #115418: Setup ow19+20 to be able to run MM tests size:M added
Actions #4

Updated by favogt over 1 year ago

With maxcpus=0 1 the system manages to boot, and then the sensor readings are available as well and look fine.
It fails with maxcups=1 already, so it appears to be a severe HW issue with either CPU or RAM.

I also tried booting the 15.3 GA kernel+initrd with GRUB's http module, but that also crashes.

Actions #5

Updated by tinita over 1 year ago

  • Target version set to Ready
Actions #6

Updated by livdywan over 1 year ago

  • Subject changed from openqaworker20 fails to boot to openqaworker20 fails to boot due to sensor reading issues size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by favogt over 1 year ago

  • Subject changed from openqaworker20 fails to boot due to sensor reading issues size:M to openqaworker20 fails to boot, broken hardware size:M

The sensor issues are a red herring, the values probably only show up if the system is fully booted.

Actions #8

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #9

Updated by mkittler over 1 year ago

Only the fans FAN1, FAN3 and FANA have "health status" in the BMC and it is critical. Maybe just wrong sensor reading (as most sensor data is in fact missing) but it could also be the culprit. That disabling CPUs helped for a while would be in accordance with and overheating system due to fan issues.

Note that I tried powering on the system again. However, the system got stuck even before reaching the bootloader. Then I tried it again and it could reach the bootloader and I could select the UEFI firmware menu from there. While being in the firmeware menu, the sensor readings on the BMC interface were unchanged. Unfortunately the UEFI menu itself doesn't show any interesting information (which would e.g. confirm that fans are broken). I only found a few entries in the event log which I've attached.

Actions #10

Updated by mkittler over 1 year ago

I could not reach the OS anymore so I could not check any sensors from there. I've shut the machine down again as I don't know what I could do next. I suppose as a next step someone with access to SRV1 needs to do a physical check.

Actions #11

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

I've just created an infra ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-96734

The only thing I can think of we could do from our side is a firmware update. However, if it is really a fan that might not be a good idea. In general, if it is really a fan we should likely refrain from doing too much.

Actions #12

Updated by favogt over 1 year ago

mkittler wrote:

Only the fans FAN1, FAN3 and FANA have "health status" in the BMC and it is critical. Maybe just wrong sensor reading (as most sensor data is in fact missing) but it could also be the culprit. That disabling CPUs helped for a while would be in accordance with and overheating system due to fan issues.

Note that I tried powering on the system again. However, the system got stuck even before reaching the bootloader. Then I tried it again and it could reach the bootloader and I could select the UEFI firmware menu from there. While being in the firmeware menu, the sensor readings on the BMC interface were unchanged. Unfortunately the UEFI menu itself doesn't show any interesting information (which would e.g. confirm that fans are broken). I only found a few entries in the event log which I've attached.

When I booted the system with maxcpus=0, I got perfectly fine and healthy sensor readings while it was "up".

mkittler wrote:

I've just created an infra ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-96734

Can you add me to CC?

Actions #13

Updated by tinita over 1 year ago

favogt wrote:

Can you add me to CC?

I added you

Actions #14

Updated by mkittler over 1 year ago

We can rule out the broken fan theory. Gerhard checked and fans are running. When accessing the system today, I found it again stuck at the boot screen (SXE--CPU Initialization … 63). Normally that state during the boot screen is very fast so it was clearly stuck again. Not sure why Gerhard said he could reliably reach GRUB. However, after a power reset I could reach GRUB again and tried also maxcpus=0. Then sensors also shows normal temperature values and all sensors are showing up green in the BMC as @fvogt mentioned (including the ones under the Fan tab).

Note that @fvogt attempted a network installation and it crashes there as well. So a re-installation of the system doesn't make much sense.

Actions #15

Updated by mkittler over 1 year ago

Btw, that's the hardware we have:

openqaworker20:~ # inxi -F
System:    Host: openqaworker20 Kernel: 5.14.21-150400.24.18-default x86_64 bits: 64 Console: pty pts/0
           Distro: openSUSE Leap 15.4
Machine:   Type: Server System: Supermicro product: Super Server v: 0123456789 serial: 0123456789
           Mobo: Supermicro model: H12SSL-i v: 1.02 serial: ZM224S601286 UEFI: American Megatrends v: 2.3 date: 10/20/2021
CPU:       Info: Single Core model: AMD EPYC 7543P bits: 64 type: UP cache: L2: 512 KiB
           Speed: 1801 MHz min/max: 1500/2800 MHz Core speed (MHz): 1: 1801
Graphics:  Device-1: ASPEED Graphics Family driver: ast v: kernel
           Display: server: No display server data found. Headless machine? tty: 316x82
           Message: Advanced graphics data unavailable in console for root.
Audio:     Message: No device data found.
Network:   Device-1: Intel Ethernet X710 for 10GbE SFP+ driver: i40e
           IF: eth0 state: down mac: 40:a6:b7:3f:b7:74
           Device-2: Intel Ethernet X710 for 10GbE SFP+ driver: i40e
           IF: eth1 state: down mac: 40:a6:b7:3f:b7:75
           Device-3: Broadcom NetXtreme BCM5720 Gigabit Ethernet PCIe driver: tg3
           IF: eth2 state: up speed: 1000 Mbps duplex: full mac: 3c:ec:ef:93:aa:22
           Device-4: Broadcom NetXtreme BCM5720 Gigabit Ethernet PCIe driver: tg3
           IF: eth3 state: down mac: 3c:ec:ef:93:aa:23
           IF-ID-1: usb0 state: down mac: 92:1b:4c:c1:45:13
Bluetooth: Device-1: Insyde RNDIS/Ethernet Gadget type: USB driver: rndis_host
           Report: This feature requires one of these tools: hciconfig/bt-adapter
Drives:    Local Storage: total: 6.99 TiB used: 433.5 GiB (6.1%)
           ID-1: /dev/nvme0n1 vendor: Samsung model: MZQL23T8HCLS-00A07 size: 3.49 TiB
           ID-2: /dev/nvme1n1 vendor: Samsung model: MZQL23T8HCLS-00A07 size: 3.49 TiB
Partition: ID-1: / size: 100 GiB used: 5.97 GiB (6.0%) fs: btrfs dev: /dev/nvme0n1p2
           ID-2: /boot/efi size: 511 MiB used: 5 MiB (1.0%) fs: vfat dev: /dev/nvme0n1p1
           ID-3: /home size: 100 GiB used: 5.97 GiB (6.0%) fs: btrfs dev: /dev/nvme0n1p2
           ID-4: /opt size: 100 GiB used: 5.97 GiB (6.0%) fs: btrfs dev: /dev/nvme0n1p2
           ID-5: /tmp size: 100 GiB used: 5.97 GiB (6.0%) fs: btrfs dev: /dev/nvme0n1p2
           ID-6: /var size: 100 GiB used: 5.97 GiB (6.0%) fs: btrfs dev: /dev/nvme0n1p2
Swap:      Alert: No swap data was found.
Sensors:   Message: No sensor data found. Is lm-sensors configured?
Info:      Processes: 190 Uptime: 1h 37m Memory: 503.57 GiB used: 6.19 GiB (1.2%) Init: systemd runlevel: 3 Shell: Bash
           inxi: 3.3.07

And that's the chassis we have: SuperMicro 825BTQC-R1K23LPB (from order)

Actions #16

Updated by mkittler over 1 year ago

  • Assignee changed from mkittler to nicksinger

Looks like we have to contact delta computers.

Actions #17

Updated by nicksinger over 1 year ago

I contacted Delta. They offered us to send back the server so they can do further debugging in their Lab. I just wrote Gerhard if he can clarify who can handle the package handling back to delta.

Actions #18

Updated by livdywan over 1 year ago

The server needs to be prepared for sending. It's in SRV1 so we need access with that

Actions #19

Updated by nicksinger over 1 year ago

I've asked infra to send it back to us: https://sd.suse.com/servicedesk/customer/portal/1/SD-98484

Actions #20

Updated by nicksinger over 1 year ago

Most recent update: Server is packed and awaits shipping. Gerhard from infra talked to facilities and they will pick the machine up and send it back (I think from frankencampus)

Actions #21

Updated by nicksinger over 1 year ago

  • Due date set to 2022-10-07

Got an answer from DELTA. Apparently the CPU was broken and they plan to swap it with a different one. According to the mail from DELTA they plan to ship it back to us (Frankenstraße) beginning next week (CW40).

Actions #22

Updated by favogt over 1 year ago

nicksinger wrote:

Got an answer from DELTA. Apparently the CPU was broken and they plan to swap it with a different one. According to the mail from DELTA they plan to ship it back to us (Frankenstraße) beginning next week (CW40).

Great news, that's surprinsingly quick!

Actions #23

Updated by okurz over 1 year ago

  • Due date changed from 2022-10-07 to 2022-10-21
  • Status changed from Feedback to Blocked

Sehr geehrte Damen und Herren,
ihr reparierter Rechner wurde heute per Spedition versendet. Im Anhang erhalten Sie Ihren elektronischen Lieferschein mit der Nummer 21660.
Sobald Ihre Sendung beim Spediteur verladen wurde, sehen Sie den Zustellstatus Ihrer Lieferung. Dieser ist in der Regel ab 8.00 Uhr, einen Tag nach Erhalt dieser Email, für Sie ersichtlich.
Die Lieferzeit für nationale Sendungen beträgt in der Regel 1-2 Werktage.

Mit dem folgenden Link können Sie Ihre Sendung direkt verfolgen:

https://www.ids-logistik.de/de/sendungsverfolgung?tracking=90461_2009043476370005

Actions #24

Updated by mgriessmeier over 1 year ago

openqaworker20 is back in town! (Frankencampus, new qe cold storage area, Riegel 2 (the middle one, if you come from the kitchen elevator, the shared space to the right)

Actions #25

Updated by okurz over 1 year ago

  • Status changed from Blocked to In Progress

@nicksinger have you organized transport to Maxtorhof? Or any other related tasks? As https://www.ids-logistik.de/de/sendungsverfolgung?tracking=90461_2009043476370005 states "Zugestellt" this shouldn't stay on "Blocked" anymore.

Actions #26

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Workable

Indeed this should not be blocked any longer. I've asked Oliver Fecher in a private slack message if he can help us to bring the machine back into maxtorhof.

Actions #27

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Feedback
Actions #28

Updated by livdywan over 1 year ago

  • Assignee changed from nicksinger to robert.richardson

Robert volunteered to ask in person, since he's in the office today

Actions #29

Updated by robert.richardson over 1 year ago

talked to Oliver Fecher earlier, he'll be at maxtorhof tomorrow anyway and will ask if its possible to get the oqaworker20 machine moved there. If yes, it will be probably dropped off at the former all hands area for now.

Actions #30

Updated by livdywan over 1 year ago

  • Due date changed from 2022-10-21 to 2022-10-28

I forgot to bump the due date. Let's see where we get by the end of the week

Actions #31

Updated by robert.richardson over 1 year ago

  • Due date deleted (2022-10-28)
  • Status changed from Feedback to Blocked
Actions #32

Updated by okurz over 1 year ago

Also there is https://sd.suse.com/servicedesk/customer/portal/1/SD-98484 where gschlotter stated updates and plans the remounting

Actions #33

Updated by nicksinger over 1 year ago

machine is back and mounted. IPMI can be pinged but login does not work. I suspect DELTA reset the password while testing. I spoke with Gerhard if we can manage some kind of recovery in a timely manner. Max will be in Maxtorhof today but can't unmount the machine alone to check the password on the sticker inside so our best approach would be to boot some recovery medium over USB to reset the PW "locally".

Actions #34

Updated by okurz over 1 year ago

  • Status changed from Blocked to Workable
  • Assignee changed from robert.richardson to nicksinger

Nick and me will try to meet Gerhard or Max and solve that in place at Maxtorhof.

Actions #35

Updated by okurz over 1 year ago

  • Tags set to next-office-day
Actions #36

Updated by okurz over 1 year ago

Last appointment did not work out. Planned for this week.

Actions #37

Updated by nicksinger over 1 year ago

  • Tags deleted (next-office-day)
Actions #38

Updated by livdywan over 1 year ago

  • Subject changed from openqaworker20 fails to boot, broken hardware size:M to openqaworker20 fails to boot, broken hardware

This ticket has no AC. I suggest we estimate it again and fill in the gaps.

Actions #39

Updated by mkittler over 1 year ago

  • Description updated (diff)

I was so free to add acceptance criteria although this should be quite self-explaining.

Actions #40

Updated by mkittler over 1 year ago

  • Description updated (diff)

I was so free to add acceptance criteria although this should be quite self-explaining.

Actions #41

Updated by favogt over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee changed from nicksinger to favogt
  • % Done changed from 0 to 90

Apparently openqaworker20 was moved to the new security zone already, so I had to request access to qe-jumpy. After that was granted I could set the machine up again. So far a simple (xfce-live) test passed and a local MM test scenario (firewalld-policy-*) also works.

Here are my notes of what I did, basically copy-pasting commands after I ran them. Apparently using code blocks in ordered lists doesn't quite work right here, so feel free to count yourself.

  1. Enabled the tftp + uefi parts in /etc/dnsmasq.d/pxeboot.conf again. I had to make the efi file unconditional, the tag did somehow not match and it defaulted to grub for ppc64 (visible in dnsmasq's log).
  2. Booted the server into IPv4 PXE mode on the right interface.
  3. Completed the installation over ipmitool SoL, enabled SSH service + opened ports.
  4. Came up after a reboot. Initial setup:
zypper ar -p 95 'http://download.opensuse.org/repositories/devel:openQA/$releasever' devel_openQA
zypper ar -p 90 'http://download.opensuse.org/repositories/devel:openQA:Leap:15.4/15.4' devel_openQA_Leap
zypper -v ref
zypper in bash-completion nano # yes :P
# relogin to activate bash completion
zypper in os-autoinst-distri-opensuse-deps
zypper in os-autoinst-openvswitch
nano /etc/openqa/client.conf # fill with data
nano /etc/openqa/workes.ini # copied from ow19, WORKER_CLASS adjusted
nano /etc/fstab # add /var/lib/openqa/share mount, copied from ow19
zypper in nfs-client
systemctl start var-lib-openqa-share.automount
zypper in openQA-worker # Should've done that earlier, thought it was included already:
# warning: /etc/openqa/client.conf created as /etc/openqa/client.conf.rpmnew
# Detected autofs mount point /var/lib/openqa/share during canonicalization of /var/lib/openqa/share.
# Skipping /var/lib/openqa/share
systemctl enable openqa-worker-auto-restart@{1..20}
mv /etc/openqa/workes.ini /etc/openqa/workers.ini # To fix the earlier typo...
systemctl enable --now openqa-worker-cacheservice openqa-worker-cacheservice-minion
systemctl start openqa-worker-auto-restart@1
systemctl disable --now salt-minion.service # not used yet
# Clone a test job (xfce-live): openqa-clone-job --within-instance https://openqa.opensuse.org 2927811 _GROUP=38 WORKER_CLASS=openqaworker20
zypper in --oldpackage http://download.opensuse.org/update/leap/15.3/sle/noarch/qemu-ovmf-x86_64-202008-150300.10.14.1.noarch.rpm
zypper al -m "poo#116914" qemu-ovmf-x86_64
zypper in ffmpeg-4 # Referenced in workers.ini
firewall-cmd --permanent --new-service isotovideo
for i in {1..50}; do firewall-cmd --permanent --service=isotovideo --add-port=$((i * 10 + 20003))/tcp ; done
firewall-cmd --permanent --zone=public --add-service=isotovideo
firewall-cmd --reload
# Test looks good!
  1. Set up OpenVSwitch with MM networking:
systemctl enable --now openvswitch os-autoinst-openvswitch
# Copy openqa-prepare-mm-setup from ow19
./openqa-prepare-mm-setup
# Note: On ow20 it's eth1, not eth0
cat > /etc/firewalld/zones/trusted.xml <<EOF
<?xml version="1.0" encoding="utf-8"?>
<zone target="ACCEPT">
  <short>Trusted</short>
  <description>All network connections are accepted.</description>
  <interface name="br1"/>
  <interface name="eth1"/>
  <masquerade/>
</zone>
EOF
firewall-cmd --permanent --zone=public --remove-interface=eth1
firewall-cmd --reload
reboot
  1. Wait until it's rebooted and log in again. The logs show that firewalld broke, because of an already fixed bug, so run zypper dup
  2. Clone a MM test (firewalld-policy-firewall): openqa-clone-job --within-instance https://openqa.opensuse.org --max-depth 0 --skip-chained-deps 2928052 _GROUP=38 WORKER_CLASS=openqaworker20
  3. Notice that it failed because ifcfg-eth1 still had ZONE=public.
nano /etc/sysconfig/network/ifcfg-eth1 /etc/firewalld/zones/public.xml /etc/firewalld/zones/trusted.xml
rcfirewalld restart
  1. Noticed that the wicked script is invalid. Fix openqa-prepare-mm-setup (remove newline before #!/bin/sh, $bridge -> \$bridge) and run it again.
  2. Rebooted and restarted failed test -> green
Actions #42

Updated by openqa_review over 1 year ago

  • Due date set to 2022-12-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #43

Updated by favogt over 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

ow20 is enabled for production now, with tap. So far I didn't observe any issues.

I noticed that systemctl enable --now openqa-reload-worker-auto-restart@{1..20}.path was missing, did that as well.

Actions #44

Updated by favogt over 1 year ago

Finally, a failure: The OVMF flavors for staging were missing.

Fixed by doing scp openqaworker19:/usr/share/qemu/*staging* /usr/share/qemu/.

Actions #45

Updated by favogt over 1 year ago

QEMU_ENABLE_SMBD=1 (needed by wsl) needed zypper in samba as well.

Actions #46

Updated by okurz over 1 year ago

  • Status changed from Resolved to Feedback

at least I found "auto-update" missing from systemd timers. We should closely check if there is anything else missing compared to other o3 workers

Actions #47

Updated by favogt over 1 year ago

okurz wrote:

at least I found "auto-update" missing from systemd timers. We should closely check if there is anything else missing compared to other o3 workers

I installed that - did I forget to enable it?

Actions #48

Updated by livdywan over 1 year ago

  • Subject changed from openqaworker20 fails to boot, broken hardware to openqaworker20 fails to boot, broken hardware size:M
  • Description updated (diff)
Actions #49

Updated by okurz over 1 year ago

  • Due date deleted (2022-12-20)
  • Status changed from Feedback to Resolved

favogt wrote:

okurz wrote:

at least I found "auto-update" missing from systemd timers. We should closely check if there is anything else missing compared to other o3 workers

I installed that - did I forget to enable it?

I have not found the package "openQA-auto-update" installed. I installed and enabled it with:

ssh root@openqaworker20 "zypper -n in openQA-auto-update && systemctl enable --now openqa-auto-update.timer"

Rest looks fine. Thanks for the good setup!

Actions #50

Updated by favogt over 1 year ago

I have not found the package "openQA-auto-update" installed.

I probably mixed that up with openQA-continuous-update. Confusing naming...

Actions

Also available in: Atom PDF