Project

General

Profile

Actions

action #174448

open

bare-metal5 and bare-metal6 fail to boot from PXE most times

Added by dheidler 4 days ago. Updated 3 minutes ago.

Status:
New
Priority:
High
Assignee:
-
Category:
Regressions/Crashes
Start date:
2024-12-16
Due date:
2024-12-31 (Due in 11 days)
% Done:

0%

Estimated time:

Description

Brought up in #174352#note-8

Worker slot in question: https://openqa.suse.de/admin/workers/3992

https://openqa.suse.de/tests/16190397 is a good example.

[2024-12-13T10:57:41.004199Z] [debug] [pid:38779] setting iPXE bootscript on http://baremetal-support.qe.prg2.suse.org for 10.146.4.107 to:
  #!ipxe
  echo ++++++++++++++++++++++++++++++++++++++++++
  echo ++++++++++++ openQA ipxe boot ++++++++++++
  echo +    Host: bare-metal5.qe.prg2.suse.org
  echo ++++++++++++++++++++++++++++++++++++++++++

  kernel http://openqa.suse.de/assets/repo/fixed/SLE-15-SP6-Online-x86_64-GM-Media1/boot/x86_64/loader/linux install=http://openqa.suse.de/assets/repo/fixed/SLE-15-SP6-Online-x86_64-GM-Media1   root=/dev/ram0  initrd=initrd  textmode=1  autoyast=http://worker36.oqa.prg2.suse.org:20623/PnzpD17aUM9vS2E7/files/bare-metal5.qe.prg2.suse.orgvirt_autotest/host_unattended_installation_files/autoyast/dev_host_15.xml sshd=1 sshpassword=nots3cr3t  plymouth.enable=0  video=1024x768 vt.color=0x07  console=ttyS1,115200  Y2DEBUG=1 linuxrc.log=/dev/ttyS1 linuxrc.core=/dev/ttyS1 linuxrc.debug=4,trace  reboot_timeout=0 
  initrd http://openqa.suse.de/assets/repo/fixed/SLE-15-SP6-Online-x86_64-GM-Media1/boot/x86_64/loader/initrd
  boot
  
[2024-12-13T10:57:41.008929Z] [debug] [pid:38779] 200 OK
  
[2024-12-13T10:57:41.009014Z] [debug] [pid:38779] setting boot device to pxe
[2024-12-13T10:57:41.068997Z] [debug] [pid:38779] IPMI: Set Boot Device to pxe
[2024-12-13T10:57:44.131260Z] [debug] [pid:38779] IPMI: Boot parameter version: 1
  Boot parameter 5 is valid/unlocked
  Boot parameter data: a004000000
   Boot Flags :
     - Boot Flag Valid
     - Options apply to only next boot
     - BIOS EFI boot 
     - Boot Device Selector : Force PXE
     - BIOS verbosity : System Default
     - Console Redirection control : Console redirection occurs per BIOS configuration setting (default)
     - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST
[2024-12-13T10:57:44.189398Z] [debug] [pid:38779] IPMI: Chassis Power Control: Up/On
[2024-12-13T10:57:47.248980Z] [debug] [pid:38779] IPMI: Chassis Power is off
[2024-12-13T10:57:47.309820Z] [debug] [pid:38779] IPMI: Chassis Power Control: Up/On
[2024-12-13T10:57:50.368759Z] [debug] [pid:38779] IPMI: Chassis Power is off
[2024-12-13T10:57:50.461045Z] [debug] [pid:38779] IPMI: Chassis Power Control: Up/On
[2024-12-13T10:57:53.516036Z] [debug] [pid:38779] IPMI: Chassis Power is on

Frame by frame analysis of the video doesn't indicate any attempt to perform a PXE boot.




Files

frame0114.png (24 KB) frame0114.png dheidler, 2024-12-16 10:56
frame0115.png (2.57 KB) frame0115.png dheidler, 2024-12-16 10:57
frame0148.png (18.5 KB) frame0148.png dheidler, 2024-12-16 10:57

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #174352: 2 ipmi backend baremetal machines in OSD worker pool are offline size:SResolveddheidler2024-12-132024-12-28

Actions
Actions #1

Updated by dheidler 4 days ago

  • Description updated (diff)
Actions #2

Updated by dheidler 4 days ago

  • Related to action #174352: 2 ipmi backend baremetal machines in OSD worker pool are offline size:S added

Updated by dheidler 4 days ago

Actions #4

Updated by livdywan 4 days ago

  • Description updated (diff)
  • Target version set to Ready

I think this should also go on the backlog as the workers are online but still basically unable to run tests.

Actions #5

Updated by Julie_CAO 4 days ago · Edited

Hi @xlai , is there a USB stick on this machine? Or do you know any change has been made to this machine recently? they used to run tests well but begun to fail to boot now.

CC @xguo as you are quite familiar with the two machines.

Actions #6

Updated by dheidler 4 days ago · Edited

  • Status changed from New to In Progress
  • Assignee set to dheidler
  • Took bare-metal5 out of production
  • PXE or bios bootdev selection via ipmitool for next boot are NOT followed (at least using chassis bootdev pxe chassis power off chassis power on despite chassis bootparam get 5 showing otherwise)
  • disabled quiet boot via bios setup
  • still no change
  • deleted sles boot entry and put network boot to first in order
  • reenabled
  • let's see https://openqa.suse.de/tests/16203423#live
Actions #7

Updated by xlai 3 days ago

Julie_CAO wrote in #note-5:

Hi @xlai , is there a USB stick on this machine? Or do you know any change has been made to this machine recently? they used to run tests well but begun to fail to boot now.

No, the usbs are on bare-metal{1,2}. And I do not see changes on the two machines from what I know.

Actions #8

Updated by Julie_CAO 3 days ago

dheidler wrote in #note-6:

  • Took bare-metal5 out of production
  • PXE or bios bootdev selection via ipmitool for next boot are NOT followed (at least using chassis bootdev pxe chassis power off chassis power on despite chassis bootparam get 5 showing otherwise)
  • disabled quiet boot via bios setup
  • still no change
  • deleted sles boot entry and put network boot to first in order
  • reenabled
  • let's see https://openqa.suse.de/tests/16203423#live

Thanks for these actions, but it still failed to boot in https://openqa.suse.de/tests/16203423

Actions #9

Updated by openqa_review 3 days ago

  • Due date set to 2024-12-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by dheidler 3 days ago

redfishtool -r bare-metal5-ipmi.qe.prg2.suse.org -u ***** -p ***** Systems get | jq .Boot
{
  "BootSourceOverrideEnabled": "Once",
  "BootSourceOverrideMode": "UEFI",
  "BootSourceOverrideTarget": "Pxe",
  "BootSourceOverrideTarget@Redfish.AllowableValues": [
    "None",
    "Pxe",
    "Floppy",
    "Cd",
    "Usb",
    "Hdd",
    "BiosSetup",
    "UsbCd",
    "UefiBootNext",
    "UefiHttp"
  ],
  "BootOptions": {
    "@odata.id": "/redfish/v1/Systems/1/BootOptions"
  },
  "BootNext": null,
  "BootOrder": [
    "Boot0000",
    "Boot0003",
    "Boot0004",
    "Boot0002"
  ]
}

Redshift shows similar output as IPMITOOL does.

Actions #11

Updated by dheidler 3 days ago

Updated BIOS from 2.3a to 2.4 and BMC from 01.02.09 to 01.03.05 on bare-metal5

Actions #12

Updated by dheidler about 23 hours ago

  • Status changed from In Progress to Workable
  • Assignee deleted (dheidler)

I'm out of ideas here.

Actions #13

Updated by Julie_CAO about 23 hours ago

@xguo Do you have ideas?

@xlai shall we pull them out of the OSD worker pool until the ticket is resolved?

Actions #14

Updated by xlai about 22 hours ago

Has it been tried if selecting pxe boot from bios works? Besides, I may suggest to check boot options in bios, allowing only pxe boot and disk boot.

@xlai shall we pull them out of the OSD worker pool until the ticket is resolved?

Yes, please help to.

Actions #15

Updated by MMoese about 21 hours ago

Some machines really don't want to have ipmi controlled boot devices.

With kernel baremetal hardware, we solved this with a workaround. We have set those of our machines to always boot from PXE first and their first NVMe as second boot device. They always boot, get their bootscript fron the baremetal support service. When we don't want to install them, we just use https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/7c85092bafb2a6ca9f64d91b871544790da683ce/tests/installation/ipxe_install.pm#L126 and the machine boots from disk. I've mostly observed this for UEFI machines and don't remember to encounter it for legacy boot.

Not sure if this helps you.

Actions #16

Updated by xguo about 19 hours ago · Edited

Julie_CAO wrote in #note-13:

@xguo Do you have ideas?

No ideas.

Try to add IPMI_BACKEND_MC_RESET=1. but, confirm that does not work very well with bare-metal5 either. bare-metal5 is still unstable now.

FYI.
Refer to https://openqa.suse.de/admin/workers/3992 for more details.

Actions #17

Updated by okurz 3 minutes ago

  • Status changed from Workable to New
Actions

Also available in: Atom PDF