action #165611
closed[openQA][infra][sut][aarch64] Power supply failure on squiddlydiddly arm64 machine size:M
0%
Description
Observation¶
No output, nothing printed out on ipmi sol console of squiddlydiddly arm64 machine:
localhost:~ # ipmitool -C3 -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'xxxxxxx' sol activate
[SOL Session operational. Use ~? for help]
Steps to reproduce¶
- Establish ipmi sol connection to squiddlydiddly
- Perform power reset
- Wait for ipmi sol console output
Impact¶
As ipmi backend machine, no output on ipmi sol console means destructive impact
Problem¶
- BMC issue ??? But resetting bmc does not help
Suggestions¶
- Inspect BMC in person ???
- Consider a firmware upgrade, see #162593
- Power cycle the machine
- Send the machine back, see #165611#note-22
Rollback steps¶
ssh imagetester.qe.nue2.suse.org "sudo systemctl unmask openqa-{,reload-}worker-auto-restart@16 && sudo systemctl enable --now openqa-{,reload-}worker-auto-restart@16"
Workaround¶
n/a
Files
Updated by rcai 3 months ago
related SD ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-165090
Updated by mkittler 3 months ago
- Description updated (diff)
- Status changed from New to Resolved
- Assignee set to mkittler
It works from my side. I had to deactivate an existing session before but sol activate
says [SOL Session operational. Use ~? for help]
. Note that there's currently no prompt showing up but that's merely because the system is powered off (which is supposedly also ok at this point).
Maybe the OS on the machine also needs to be configured to print anything over SOL but I don't think the problem is IPMI/SOL itself.
Please re-open the ticket if you think we can help you further.
Updated by waynechen55 3 months ago
- Status changed from Resolved to New
mkittler wrote in #note-5:
It works from my side. I had to deactivate an existing session before but
sol activate
says[SOL Session operational. Use ~? for help]
. Note that there's currently no prompt showing up but that's merely because the system is powered off (which is supposedly also ok at this point).Maybe the OS on the machine also needs to be configured to print anything over SOL but I don't think the problem is IPMI/SOL itself.
Please re-open the ticket if you think we can help you further.
This is not the case. After powering it on, there is nothing printed out on its ipmi sol console which is the issue reported in this ticket. You can not see a single character on ipmi sol console.
Updated by rcai 3 months ago · Edited
Hi @mkittler,
Some error statuses:
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power on
Chassis Power Control: Up/On
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power status
Chassis Power is off
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power status
Chassis Power is off
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power status
Chassis Power is off
cailf@cailf:~> ipmitool -I lanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U ADMIN -P 'XXX-@3-vt' chassis power status
Chassis Power is off
Still cannot power on this machine.
Check log: it shows PSU failure.
1724313366 Critical
2024-08-22
07:56:06 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313354 Critical
2024-08-22
07:55:54 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313343 Critical
2024-08-22
07:55:43 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313331 Critical
2024-08-22
07:55:31 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313320 Critical
2024-08-22
07:55:20 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313308 Critical
2024-08-22
07:55:08 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313297 Critical
2024-08-22
07:54:57 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313285 Critical
2024-08-22
07:54:45 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313274 Critical
2024-08-22
07:54:34 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313262 Critical
2024-08-22
07:54:22 UTC
Power supply power good failed to assert within 8000 milliseconds.
1724313250 Critical
2024-08-22
07:54:10 UTC
Power supply power good failed to assert within 8000 milliseconds.
Updated by mkittler 3 months ago
- Status changed from In Progress to Feedback
I pulled the plug of the machine on http://epdu-b4.qe.nue2.suse.org and plugged it back after 30 minutes. After this can can power on the machine via ipmi and also get a prompt via sol. The SLE 15 SP6 system on the machine booted fine.
Updated by waynechen55 3 months ago
mkittler wrote in #note-9:
I pulled the plug of the machine on http://epdu-b4.qe.nue2.suse.org and plugged it back after 30 minutes. After this can can power on the machine via ipmi and also get a prompt via sol. The SLE 15 SP6 system on the machine booted fine.
Thanks. But not sure whether the power supply is stable enough. I think it would be better to keep an eye out and ensure power supply is good enough for the machine to be up and running in long term. Broken power supply is critical I think and can damage the machine unexpectedly.
Updated by mkittler 3 months ago · Edited
Not sure how to diagnose the PSU¹. For now I'd just assume it was stuck in a problematic state but resetting it in the way I did helped. If we see any problems with it in the future again we can consider to replace the PSU.
¹I can only tell that there are no further PSU-related entries in the logs on https://squiddlydiddly-sp.qe.nue2.suse.org after the re-plugging.
Updated by waynechen55 3 months ago
Let me reiterate the issue and impact:
arm64 machine squiddlydiddly can not be powered on from time to time and recover measure can not solve the problem radically.
PSU keeps reporting following error each time the issue occurs:
Power supply power good failed to assert within 8000 milliseconds. 1724313354 Critical
It seems there is something wrong with PSU.
@mkittler also thinks replacing PSU is necessary. Please refer to https://progress.opensuse.org/issues/165611#note-11
Maybe someone can help diagnose PSU before replacing it. Can we take one such step further to start solve the problem ??? @mkittler
Updated by mkittler 3 months ago
I think we need to file an SD ticket after all, see https://sd.suse.com/servicedesk/customer/portal/1/SD-167212.
Updated by mkittler 3 months ago
IT cannot help us with this machine but granted us access to https://sd.suse.com/servicedesk/customer/portal/1/SD-133647. The relevant order can also be found on https://racktables.suse.de/index.php?page=file&file_id=5346.
So I sent a mail to Delta Computers (and a copy to OSD admins) asking for support.
Updated by mkittler 3 months ago · Edited
- Tags changed from infra to infra, next-frankencampus-visit, next-office-day
- Status changed from Feedback to Workable
- Assignee deleted (
mkittler)
Unless someone wants to test the machine with a regular ATX power supply as suggested by the Delta Computers support we need to send the machine back to Delta Computers so they can investigate the issue further. According to their response the PDB is probably broken.
Checkout https://mailman.suse.de/mlarch/SuSE/osd-admins/2024/osd-admins.2024.09/msg00058.html and follow-ups for my conversation with the Delta Support. This answer also contains the address we need to send the machine back to.
So the next steps for this ticket would be:
- (optional) Cross-check with a regular ATX power supply. (Note that the re-plugging will fix the problem temporarily also with the probably broken PDB. So to cross-check this in a meaningful way we needed to keep the ATX power supply in use for a couple of days. This is probably not worth it.)
- Check for further replies from Delta Computers on the osd-admins mailing list.
- Check the workflow for this, e.g. where to get a box from (as the box the server came with was most likely discarded) and where to place the box.
- Inform Delta Computers that we have sent the machine. So far I only replied that we will probably send the machine to them at some point.
I am not sure when I'll be in the office and I don't have a card to access the office itself. So I am unassigning and adding the appropriate tags.
Updated by okurz 3 months ago
- Related to action #153111: [openQA][console][ipmi][sol] xterm process quits and ipmi sol console crashes added
Updated by okurz 3 months ago
- Status changed from Workable to In Progress
- Assignee set to okurz
I plan to look into the machine both remotely as well as in-place. I initially planned to look into the machine already yesterday but due to higher priority discussions regarding #165282 I was not able to do so yet. I can only recommend everyone to treat this machine as a non-critical testing machine. It must not be relied upon for critical validation work.
Updated by okurz 3 months ago
I checked the machine in place and found that the server was actually already powered on even though https://squiddlydiddly-sp.qe.nue2.suse.org/#/ and IPMI claim it's powered off. I forced it off with >4s press on the front power button. I powered the machine on over the BMC and the PSU status on https://squiddlydiddly-sp.qe.nue2.suse.org/#/system/component-info/psu was reporting data but the AC input power shows 0 W for both PSUs but voltage*current compute to a rather low consumption of 30W+30W. The web BMC KVM as well as IPMI SoL don't show anything. I unplugged both power plugs and also removed the PSUs temporarily from the case and ensure proper seating. Replugging immediately powered on the machine. After around 1m I could access the BMC again and after another 2-5m the Supermicro startup logo showed up. After another 2-5m the currently configured Linux SLE15-SP7 test snapshot booted up. Also the PSU status shows a reasonable consumption now. I took a better measurement of the times it takes to show certain reaction. After powering on after 10s status messages show up on IPMI SoL. After 90s after power on the supermicro logo shows up on a locally connected display and remote KVM. At 2m30s after power up iPXE starts. At 3m0s after power on the iPXE handling is finished with message "EFI stub: Exiting boot services..." and the Linux kernel output on IPMI SoL starts. At 4m40s after power on the installer system responds to ping over network.
Updated by okurz 3 months ago · Edited
- Description updated (diff)
- Due date set to 2024-09-30
I realized the worker instance was never disabled from production. Did that now and added according rollback step. Now a stability experiment started at 2024-09-16 10:00Z
for i in {1..1000}; do echo "### run $i/1000 at $(date -Is), ipmi power on .. " && sudo ipmitool -Ilanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U $user -P $password chassis bootdev pxe && sudo ipmitool -Ilanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U $user -P $password power on && echo -n "ok, ping .. " && timeout -k 5 600 sh -c "until ping -c30 squiddlydiddly.qe.nue2.suse.org >/dev/null; do :; done" && echo "ok, ipmi power off .. " && sudo ipmitool -Ilanplus -H squiddlydiddly-sp.qe.nue2.suse.org -U $user -P $password power off && echo -n "ok, sleeping .." && sleep 120 && echo "ok, cycling"; done
Updated by okurz 2 months ago · Edited
- Due date deleted (
2024-09-30)
The above experiment finished with
### run 1000/1000 at 2024-09-20T02:28:16+00:00, ipmi power on ..
Set Boot Device to pxe
Chassis Power Control: Up/On
ok, ping .. ok, ipmi power off ..
Chassis Power Control: Down/Off
ok, sleeping ..ok, cycling
For now I consider the machine stable. My hypotheses regarding why the initial problem happened are either that we have a sporadic problem which would take a longer time to reproduce or that we had a connectivity problem within the PSU connection fixed by me unplugging and replugging the PSUs or that the BMC and firmware upgrade was causing an inconsistent system state solved by me completely unplugging and replugging the system.
I triggered https://openqa.suse.de/tests/15487266 for verification. Failed in https://openqa.suse.de/tests/15487266#step/handle_reboot/3 as the machine tries to boot the installer again instead of the installed system. I now set in https://squiddlydiddly-sp.qe.nue2.suse.org/#/operations/server-power-operations to boot from HDD by default. Let's see if that helps. Otherwise I guess I need to go into the BIOS and change the boot order.
Updated by okurz 2 months ago
- Tags changed from infra, next-frankencampus-visit, next-office-day to infra
- Due date set to 2024-09-27
- Status changed from Feedback to Resolved
I verified that the system works fine as expected when instructing boots manually. The installation as visible in the openQA jobs finish correctly though the machine is either never triggered to reboot or is stuck in that state. However when booting from the local storage device I see a fully usable SLE installation. Regardless the original problem regarding the power supply failure seems to be fixed by now. As no last good was stated and as https://openqa.suse.de/admin/workers/3206 contains no ok job in the history and as https://openqa.suse.de/tests/15488395#step/handle_reboot/4 is as far as openQA tests reach I consider the original problem resolved. The rest is out of scope.