action #174448
closedbare-metal5 and bare-metal6 fail to boot from PXE most times
100%
Description
Brought up in #174352#note-8
Worker slot in question: https://openqa.suse.de/admin/workers/3992
https://openqa.suse.de/tests/16190397 is a good example.
[37m[2024-12-13T10:57:41.004199Z] [debug] [pid:38779] setting iPXE bootscript on http://baremetal-support.qe.prg2.suse.org for 10.146.4.107 to:
#!ipxe
echo ++++++++++++++++++++++++++++++++++++++++++
echo ++++++++++++ openQA ipxe boot ++++++++++++
echo + Host: bare-metal5.qe.prg2.suse.org
echo ++++++++++++++++++++++++++++++++++++++++++
kernel http://openqa.suse.de/assets/repo/fixed/SLE-15-SP6-Online-x86_64-GM-Media1/boot/x86_64/loader/linux install=http://openqa.suse.de/assets/repo/fixed/SLE-15-SP6-Online-x86_64-GM-Media1 root=/dev/ram0 initrd=initrd textmode=1 autoyast=http://worker36.oqa.prg2.suse.org:20623/PnzpD17aUM9vS2E7/files/bare-metal5.qe.prg2.suse.orgvirt_autotest/host_unattended_installation_files/autoyast/dev_host_15.xml sshd=1 sshpassword=nots3cr3t plymouth.enable=0 video=1024x768 vt.color=0x07 console=ttyS1,115200 Y2DEBUG=1 linuxrc.log=/dev/ttyS1 linuxrc.core=/dev/ttyS1 linuxrc.debug=4,trace reboot_timeout=0
initrd http://openqa.suse.de/assets/repo/fixed/SLE-15-SP6-Online-x86_64-GM-Media1/boot/x86_64/loader/initrd
boot
[0m
[37m[2024-12-13T10:57:41.008929Z] [debug] [pid:38779] 200 OK
[0m
[37m[2024-12-13T10:57:41.009014Z] [debug] [pid:38779] setting boot device to pxe[0m
[37m[2024-12-13T10:57:41.068997Z] [debug] [pid:38779] IPMI: Set Boot Device to pxe[0m
[37m[2024-12-13T10:57:44.131260Z] [debug] [pid:38779] IPMI: Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: a004000000
Boot Flags :
- Boot Flag Valid
- Options apply to only next boot
- BIOS EFI boot
- Boot Device Selector : Force PXE
- BIOS verbosity : System Default
- Console Redirection control : Console redirection occurs per BIOS configuration setting (default)
- BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST[0m
[37m[2024-12-13T10:57:44.189398Z] [debug] [pid:38779] IPMI: Chassis Power Control: Up/On[0m
[37m[2024-12-13T10:57:47.248980Z] [debug] [pid:38779] IPMI: Chassis Power is off[0m
[37m[2024-12-13T10:57:47.309820Z] [debug] [pid:38779] IPMI: Chassis Power Control: Up/On[0m
[37m[2024-12-13T10:57:50.368759Z] [debug] [pid:38779] IPMI: Chassis Power is off[0m
[37m[2024-12-13T10:57:50.461045Z] [debug] [pid:38779] IPMI: Chassis Power Control: Up/On[0m
[37m[2024-12-13T10:57:53.516036Z] [debug] [pid:38779] IPMI: Chassis Power is on[0m
Frame by frame analysis of the video doesn't indicate any attempt to perform a PXE boot.
Files
Updated by dheidler 3 months ago
- Related to action #174352: 2 ipmi backend baremetal machines in OSD worker pool are offline size:S added
Updated by dheidler 3 months ago
- File frame0114.png frame0114.png added
- File frame0115.png frame0115.png added
- File frame0148.png frame0148.png added
- Description updated (diff)
Updated by dheidler 3 months ago · Edited
- Status changed from New to In Progress
- Assignee set to dheidler
- Took bare-metal5 out of production
- PXE or bios bootdev selection via ipmitool for next boot are NOT followed
(at least using
chassis bootdev pxe
chassis power off
chassis power on
despitechassis bootparam get 5
showing otherwise) - disabled quiet boot via bios setup
- still no change
- deleted sles boot entry and put network boot to first in order
- reenabled
- let's see https://openqa.suse.de/tests/16203423#live
Updated by xlai 3 months ago
Julie_CAO wrote in #note-5:
Hi @xlai , is there a USB stick on this machine? Or do you know any change has been made to this machine recently? they used to run tests well but begun to fail to boot now.
No, the usbs are on bare-metal{1,2}. And I do not see changes on the two machines from what I know.
Updated by Julie_CAO 3 months ago
dheidler wrote in #note-6:
- Took bare-metal5 out of production
- PXE or bios bootdev selection via ipmitool for next boot are NOT followed (at least using
chassis bootdev pxe
chassis power off
chassis power on
despitechassis bootparam get 5
showing otherwise)- disabled quiet boot via bios setup
- still no change
- deleted sles boot entry and put network boot to first in order
- reenabled
- let's see https://openqa.suse.de/tests/16203423#live
Thanks for these actions, but it still failed to boot in https://openqa.suse.de/tests/16203423
Updated by openqa_review 3 months ago
- Due date set to 2024-12-31
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 3 months ago
redfishtool -r bare-metal5-ipmi.qe.prg2.suse.org -u ***** -p ***** Systems get | jq .Boot
{
"BootSourceOverrideEnabled": "Once",
"BootSourceOverrideMode": "UEFI",
"BootSourceOverrideTarget": "Pxe",
"BootSourceOverrideTarget@Redfish.AllowableValues": [
"None",
"Pxe",
"Floppy",
"Cd",
"Usb",
"Hdd",
"BiosSetup",
"UsbCd",
"UefiBootNext",
"UefiHttp"
],
"BootOptions": {
"@odata.id": "/redfish/v1/Systems/1/BootOptions"
},
"BootNext": null,
"BootOrder": [
"Boot0000",
"Boot0003",
"Boot0004",
"Boot0002"
]
}
Redshift shows similar output as IPMITOOL does.
Updated by MMoese 3 months ago
Some machines really don't want to have ipmi controlled boot devices.
With kernel baremetal hardware, we solved this with a workaround. We have set those of our machines to always boot from PXE first and their first NVMe as second boot device. They always boot, get their bootscript fron the baremetal support service. When we don't want to install them, we just use https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/7c85092bafb2a6ca9f64d91b871544790da683ce/tests/installation/ipxe_install.pm#L126 and the machine boots from disk. I've mostly observed this for UEFI machines and don't remember to encounter it for legacy boot.
Not sure if this helps you.
Updated by xguo 3 months ago · Edited
Julie_CAO wrote in #note-13:
@xguo Do you have ideas?
No ideas.
Try to add IPMI_BACKEND_MC_RESET=1. but, confirm that does not work very well with bare-metal5 either. bare-metal5 is still unstable now.
FYI.
Refer to https://openqa.suse.de/admin/workers/3992 for more details.
Updated by Julie_CAO 3 months ago
The machine booted from pxe successfully, https://openqa.suse.de/tests/16262884
Updated by Julie_CAO 3 months ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
The machine are working fine as an ipmi worker now. https://openqa.suse.de/admin/workers/3992
I close this thicket. Thanks everyone who got involved in it.