Project

General

Profile

Actions

action #165611

closed

[openQA][infra][sut][aarch64] Power supply failure on squiddlydiddly arm64 machine size:M

Added by waynechen55 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-08-22
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

No output, nothing printed out on ipmi sol console of squiddlydiddly arm64 machine:

localhost:~ # ipmitool -C3 -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'xxxxxxx' sol activate
[SOL Session operational.  Use ~? for help]

Steps to reproduce

  • Establish ipmi sol connection to squiddlydiddly
  • Perform power reset
  • Wait for ipmi sol console output

Impact

As ipmi backend machine, no output on ipmi sol console means destructive impact

Problem

  • BMC issue ??? But resetting bmc does not help

Suggestions

  • Inspect BMC in person ???
  • Consider a firmware upgrade, see #162593
  • Power cycle the machine
  • Send the machine back, see #165611#note-22

Rollback steps

  • ssh imagetester.qe.nue2.suse.org "sudo systemctl unmask openqa-{,reload-}worker-auto-restart@16 && sudo systemctl enable --now openqa-{,reload-}worker-auto-restart@16"

Workaround

n/a


Files

clipboard-202408291508-a2azx.png (22.1 KB) clipboard-202408291508-a2azx.png PSU failure rcai, 2024-08-29 07:08

Related issues 1 (1 open0 closed)

Related to openQA Project - action #153111: [openQA][console][ipmi][sol] xterm process quits and ipmi sol console crashesNew2024-01-04

Actions
Actions #1

Updated by rcai 3 months ago

1724313469 Critical

2024-08-22

07:57:49 UTC

Power supply power good failed to assert within 8000 milliseconds.

it is still PSU issue.
last time, fixed such issue temporarily by unplug and plug PSU.

Actions #3

Updated by tinita 3 months ago

  • Target version set to Ready
Actions #4

Updated by tinita 3 months ago

  • Tags set to infra
  • Category set to Regressions/Crashes
Actions #5

Updated by mkittler 3 months ago

  • Description updated (diff)
  • Status changed from New to Resolved
  • Assignee set to mkittler

It works from my side. I had to deactivate an existing session before but sol activate says [SOL Session operational. Use ~? for help]. Note that there's currently no prompt showing up but that's merely because the system is powered off (which is supposedly also ok at this point).

Maybe the OS on the machine also needs to be configured to print anything over SOL but I don't think the problem is IPMI/SOL itself.

Please re-open the ticket if you think we can help you further.

Actions #6

Updated by waynechen55 3 months ago

  • Status changed from Resolved to New

mkittler wrote in #note-5:

It works from my side. I had to deactivate an existing session before but sol activate says [SOL Session operational. Use ~? for help]. Note that there's currently no prompt showing up but that's merely because the system is powered off (which is supposedly also ok at this point).

Maybe the OS on the machine also needs to be configured to print anything over SOL but I don't think the problem is IPMI/SOL itself.

Please re-open the ticket if you think we can help you further.

This is not the case. After powering it on, there is nothing printed out on its ipmi sol console which is the issue reported in this ticket. You can not see a single character on ipmi sol console.

Actions #7

Updated by rcai 3 months ago · Edited

Hi @mkittler,

Some error statuses:

cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power on
Chassis Power Control: Up/On
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power status
Chassis Power is off
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power status
Chassis Power is off
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power status
Chassis Power is off
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power status
Chassis Power is off

Still cannot power on this machine.

Check log: it shows PSU failure.

1724313366   Critical   
2024-08-22

07:56:06 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313354   Critical   
2024-08-22

07:55:54 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313343   Critical   
2024-08-22

07:55:43 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313331   Critical   
2024-08-22

07:55:31 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313320   Critical   
2024-08-22

07:55:20 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313308   Critical   
2024-08-22

07:55:08 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313297   Critical   
2024-08-22

07:54:57 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313285   Critical   
2024-08-22

07:54:45 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313274   Critical   
2024-08-22

07:54:34 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313262   Critical   
2024-08-22

07:54:22 UTC

Power supply power good failed to assert within 8000 milliseconds.
1724313250   Critical   
2024-08-22

07:54:10 UTC

Power supply power good failed to assert within 8000 milliseconds.
Actions #8

Updated by mkittler 3 months ago

  • Status changed from New to In Progress

I'll see what I can do to recover that machine.

Actions #9

Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

I pulled the plug of the machine on http://epdu-b4.qe.nue2.suse.org and plugged it back after 30 minutes. After this can can power on the machine via ipmi and also get a prompt via sol. The SLE 15 SP6 system on the machine booted fine.

Actions #10

Updated by waynechen55 3 months ago

mkittler wrote in #note-9:

I pulled the plug of the machine on http://epdu-b4.qe.nue2.suse.org and plugged it back after 30 minutes. After this can can power on the machine via ipmi and also get a prompt via sol. The SLE 15 SP6 system on the machine booted fine.

Thanks. But not sure whether the power supply is stable enough. I think it would be better to keep an eye out and ensure power supply is good enough for the machine to be up and running in long term. Broken power supply is critical I think and can damage the machine unexpectedly.

Actions #11

Updated by mkittler 3 months ago · Edited

Not sure how to diagnose the PSU¹. For now I'd just assume it was stuck in a problematic state but resetting it in the way I did helped. If we see any problems with it in the future again we can consider to replace the PSU.


¹I can only tell that there are no further PSU-related entries in the logs on https://squiddlydiddly-sp.qe.nue2.suse.org after the re-plugging.

Actions #12

Updated by rcai 3 months ago · Edited

Cannot power on again.
Hit the same issue, it cannot be fixed by unplug and plug PSU on site.
It may be necessary to contact a SUPERMICRO engineer in Europe for on-site repair if it is within warranty.

Actions #14

Updated by waynechen55 3 months ago

Time to consider replace PSU ? @mkittler

Actions #15

Updated by rcai 3 months ago

Any update about it? Many thanks. It is very important to test aarch64 platform.

Actions #16

Updated by waynechen55 3 months ago

@mkittler @okurz

Let me reiterate the issue and impact:

  1. arm64 machine squiddlydiddly can not be powered on from time to time and recover measure can not solve the problem radically.

  2. PSU keeps reporting following error each time the issue occurs:

    Power supply power good failed to assert within 8000 milliseconds.
    1724313354   Critical   
    

    It seems there is something wrong with PSU.

  3. @mkittler also thinks replacing PSU is necessary. Please refer to https://progress.opensuse.org/issues/165611#note-11

  4. Maybe someone can help diagnose PSU before replacing it. Can we take one such step further to start solve the problem ??? @mkittler

Actions #17

Updated by mkittler 3 months ago

  • Subject changed from [openQA][infra][sut][aarch64][ipmi][sol] No output on ipmi sol console of squiddlydiddly arm64 machine to [openQA][infra][sut][aarch64] Power supply failure on squiddlydiddly arm64 machine
Actions #18

Updated by mkittler 3 months ago

I think we need to file an SD ticket after all, see https://sd.suse.com/servicedesk/customer/portal/1/SD-167212.

Actions #19

Updated by mkittler 3 months ago

IT cannot help us with this machine but granted us access to https://sd.suse.com/servicedesk/customer/portal/1/SD-133647. The relevant order can also be found on https://racktables.suse.de/index.php?page=file&file_id=5346.

So I sent a mail to Delta Computers (and a copy to OSD admins) asking for support.

Actions #21

Updated by waynechen55 3 months ago

Thanks @mkittler We progressed a bit.

Actions #22

Updated by mkittler 3 months ago · Edited

  • Tags changed from infra to infra, next-frankencampus-visit, next-office-day
  • Status changed from Feedback to Workable
  • Assignee deleted (mkittler)

Unless someone wants to test the machine with a regular ATX power supply as suggested by the Delta Computers support we need to send the machine back to Delta Computers so they can investigate the issue further. According to their response the PDB is probably broken.

Checkout https://mailman.suse.de/mlarch/SuSE/osd-admins/2024/osd-admins.2024.09/msg00058.html and follow-ups for my conversation with the Delta Support. This answer also contains the address we need to send the machine back to.

So the next steps for this ticket would be:

  1. (optional) Cross-check with a regular ATX power supply. (Note that the re-plugging will fix the problem temporarily also with the probably broken PDB. So to cross-check this in a meaningful way we needed to keep the ATX power supply in use for a couple of days. This is probably not worth it.)
  2. Check for further replies from Delta Computers on the osd-admins mailing list.
  3. Check the workflow for this, e.g. where to get a box from (as the box the server came with was most likely discarded) and where to place the box.
  4. Inform Delta Computers that we have sent the machine. So far I only replied that we will probably send the machine to them at some point.

I am not sure when I'll be in the office and I don't have a card to access the office itself. So I am unassigning and adding the appropriate tags.

Actions #23

Updated by okurz 3 months ago

  • Related to action #153111: [openQA][console][ipmi][sol] xterm process quits and ipmi sol console crashes added
Actions #24

Updated by okurz 3 months ago

  • Status changed from Workable to New
Actions #25

Updated by livdywan 3 months ago

  • Subject changed from [openQA][infra][sut][aarch64] Power supply failure on squiddlydiddly arm64 machine to [openQA][infra][sut][aarch64] Power supply failure on squiddlydiddly arm64 machine size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #26

Updated by okurz 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

I plan to look into the machine both remotely as well as in-place. I initially planned to look into the machine already yesterday but due to higher priority discussions regarding #165282 I was not able to do so yet. I can only recommend everyone to treat this machine as a non-critical testing machine. It must not be relied upon for critical validation work.

Actions #27

Updated by okurz 3 months ago

  • Status changed from In Progress to Workable

No success. Need to check the machine in place

Actions #28

Updated by okurz 3 months ago

I checked the machine in place and found that the server was actually already powered on even though https://squiddlydiddly-sp.qe.nue2.suse.org/#/ and IPMI claim it's powered off. I forced it off with >4s press on the front power button. I powered the machine on over the BMC and the PSU status on https://squiddlydiddly-sp.qe.nue2.suse.org/#/system/component-info/psu was reporting data but the AC input power shows 0 W for both PSUs but voltage*current compute to a rather low consumption of 30W+30W. The web BMC KVM as well as IPMI SoL don't show anything. I unplugged both power plugs and also removed the PSUs temporarily from the case and ensure proper seating. Replugging immediately powered on the machine. After around 1m I could access the BMC again and after another 2-5m the Supermicro startup logo showed up. After another 2-5m the currently configured Linux SLE15-SP7 test snapshot booted up. Also the PSU status shows a reasonable consumption now. I took a better measurement of the times it takes to show certain reaction. After powering on after 10s status messages show up on IPMI SoL. After 90s after power on the supermicro logo shows up on a locally connected display and remote KVM. At 2m30s after power up iPXE starts. At 3m0s after power on the iPXE handling is finished with message "EFI stub: Exiting boot services..." and the Linux kernel output on IPMI SoL starts. At 4m40s after power on the installer system responds to ping over network.

Actions #29

Updated by okurz 3 months ago

  • Status changed from Workable to In Progress
Actions #30

Updated by okurz 3 months ago · Edited

  • Description updated (diff)
  • Due date set to 2024-09-30

I realized the worker instance was never disabled from production. Did that now and added according rollback step. Now a stability experiment started at 2024-09-16 10:00Z

for i in {1..1000}; do echo "### run $i/1000 at $(date -Is), ipmi power on .. " && sudo ipmitool -Ilanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U $user -P $password chassis bootdev pxe && sudo ipmitool -Ilanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U $user -P $password power on && echo -n "ok, ping .. " && timeout -k 5 600 sh -c "until ping -c30 squiddlydiddly.qe.nue2.suse.org >/dev/null; do :; done" && echo "ok, ipmi power off .. " && sudo ipmitool -Ilanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U $user -P $password power off && echo -n "ok, sleeping .." && sleep 120 && echo "ok, cycling"; done 
Actions #31

Updated by okurz 3 months ago

  • Status changed from In Progress to Feedback

the test continues to run for now. I will monitor the stability.

Actions #32

Updated by okurz 3 months ago

  • Description updated (diff)
Actions #33

Updated by okurz 2 months ago · Edited

  • Due date deleted (2024-09-30)

The above experiment finished with

### run 1000/1000 at 2024-09-20T02:28:16+00:00, ipmi power on ..
Set Boot Device to pxe
Chassis Power Control: Up/On
ok, ping .. ok, ipmi power off ..
Chassis Power Control: Down/Off
ok, sleeping ..ok, cycling 

For now I consider the machine stable. My hypotheses regarding why the initial problem happened are either that we have a sporadic problem which would take a longer time to reproduce or that we had a connectivity problem within the PSU connection fixed by me unplugging and replugging the PSUs or that the BMC and firmware upgrade was causing an inconsistent system state solved by me completely unplugging and replugging the system.

I triggered https://openqa.suse.de/tests/15487266 for verification. Failed in https://openqa.suse.de/tests/15487266#step/handle_reboot/3 as the machine tries to boot the installer again instead of the installed system. I now set in https://squiddlydiddly-sp.qe.nue2.suse.org/#/operations/server-power-operations to boot from HDD by default. Let's see if that helps. Otherwise I guess I need to go into the BIOS and change the boot order.

Actions #34

Updated by okurz 2 months ago

  • Tags changed from infra, next-frankencampus-visit, next-office-day to infra
  • Due date set to 2024-09-27
  • Status changed from Feedback to Resolved

I verified that the system works fine as expected when instructing boots manually. The installation as visible in the openQA jobs finish correctly though the machine is either never triggered to reboot or is stuck in that state. However when booting from the local storage device I see a fully usable SLE installation. Regardless the original problem regarding the power supply failure seems to be fixed by now. As no last good was stated and as https://openqa.suse.de/admin/workers/3206 contains no ok job in the history and as https://openqa.suse.de/tests/15488395#step/handle_reboot/4 is as far as openQA tests reach I consider the original problem resolved. The rest is out of scope.

Actions #35

Updated by xlai 2 months ago

@okurz Thanks for the effort on this.

@rcai Hi Roy, maybe when you have time (after urgent stuff done), you can run some automation jobs, to see if the issue is gone, no hurry.

Actions #36

Updated by xlai 2 months ago

  • Status changed from Resolved to Feedback
Actions #37

Updated by okurz 2 months ago

  • Due date changed from 2024-09-27 to 2024-10-18
  • Priority changed from Normal to Low
  • Target version changed from Ready to Tools - Next

waiting for further feedback from rcai

Actions #38

Updated by rcai 2 months ago

I tested it and works for ipxe installation on squiddlydiddly
no PSU failure again.

Actions #39

Updated by okurz 2 months ago

  • Due date deleted (2024-10-18)
  • Status changed from Feedback to Resolved
  • Priority changed from Low to Normal
  • Target version changed from Tools - Next to Ready

Alright. Assuming fixed then

Actions

Also available in: Atom PDF