Project

General

Profile

Actions

action #166394

closed

iPXE service is unavailable for baremetal SUTs on OSD worker pool size:S

Added by Julie_CAO 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Support
Start date:
2024-09-05
Due date:
% Done:

100%

Estimated time:
Tags:

Description

I run a few tests on ipmi baremetal SUTs but all failed with iPXE connection. Shall I open a SD ticket or tools team can handle it?

https://openqa.suse.de/tests/15345958/video?filename=video.webm

 PXE-E18: Server response timeout 

pxe_1

https://openqa.suse.de/tests/15348009/video?filename=video.webm

 PXE-E61: Media test failure, check cable 
 PXE-M0F: Exit Intel boot agent 

pxe_2


Files

2.png (43.8 KB) 2.png Julie_CAO, 2024-09-05 08:47
1.png (48.2 KB) 1.png Julie_CAO, 2024-09-05 08:50
3.png (54.2 KB) 3.png Julie_CAO, 2024-09-05 12:29
Actions #1

Updated by xlai 3 months ago

  • Status changed from New to Workable
  • Priority changed from Normal to Immediate
Actions #2

Updated by tinita 3 months ago · Edited

  • Target version set to Ready

I'm not sure Immediate is the right priority for this.
This is for when there is some serious issue in the infrastructure or software that affects/blocks many things/people.
Can you list the impact? thanks.

Actions #3

Updated by nicksinger 3 months ago

So the first machine in question is "bare-metal2.oqa.prg2.suse.org" (https://racktables.suse.de/index.php?page=object&tab=default&object_id=23403) which is registered to OSD as worker33:17 (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls?ref_type=heads#L2132-2146). This can be confirmed by the mac visible in the screenshots and also seen in https://openqa.suse.de/tests/15345958. 10.136.53.55 seems unknown to me so I checked what 10.145.10.155 actually is and found: https://gitlab.suse.de/OPS-Service/salt/-/blob/production/salt/profile/dns/files/prg2_suse_org/dns-oqa.prg2.suse.org#L202 - this is a dynamic IP configuration which will use the default of the network visible in: https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/init.sls#L3 which explains why it contacted the wrong server (it is supposed to get its ipxe binary from the "bare metal support server" https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml#L59-64). So clearly the machine is using the wrong interface to attempt booting (eth1 in racktables, the configured port is eth0).

The second machine is gonzo (https://racktables.suse.de/index.php?page=object&tab=default&object_id=10104) which is using the correct interface but apparently has no cable connected. We can check this one.

Actions #4

Updated by nicksinger 3 months ago

nicksinger wrote in #note-3:

The second machine is gonzo (https://racktables.suse.de/index.php?page=object&tab=default&object_id=10104) which is using the correct interface but apparently has no cable connected. We can check this one.

Wait, that is the second interface… the real reason it couldn't boot is: PXE-E11 ARP timeout

Actions #5

Updated by tinita 3 months ago

  • Status changed from Workable to New
  • Priority changed from Immediate to High
Actions #6

Updated by waynechen55 3 months ago

I think I have the same issue.

Actions #7

Updated by livdywan 3 months ago

  • Tags set to infra
  • Description updated (diff)
Actions #8

Updated by livdywan 3 months ago

  • Subject changed from iPXE service is unavailable for baremetal SUTs on OSD worker pool to iPXE service is unavailable for baremetal SUTs on OSD worker pool size:S
  • Description updated (diff)
  • Category set to Support
Actions #9

Updated by Julie_CAO 3 months ago

Thank you Nick for taking a look.

I'll check bare-meta2 machine.

Another failure of PXE-E11: ARP timeout on amd-zen3-gpu-sut1, https://openqa.suse.de/tests/15338865/video?filename=video.webm

Actions #10

Updated by Julie_CAO 3 months ago

nicksinger wrote in #note-3:

So the first machine in question is "bare-metal2.oqa.prg2.suse.org" (https://racktables.suse.de/index.php?page=object&tab=default&object_id=23403) which is registered to OSD as worker33:17 (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls?ref_type=heads#L2132-2146). This can be confirmed by the mac visible in the screenshots and also seen in https://openqa.suse.de/tests/15345958. 10.136.53.55 seems unknown to me so I checked what 10.145.10.155 actually is and found: https://gitlab.suse.de/OPS-Service/salt/-/blob/production/salt/profile/dns/files/prg2_suse_org/dns-oqa.prg2.suse.org#L202 - this is a dynamic IP configuration which will use the default of the network visible in: https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/init.sls#L3 which explains why it contacted the wrong server (it is supposed to get its ipxe binary from the "bare metal support server" https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml#L59-64). So clearly the machine is using the wrong interface to attempt booting (eth1 in racktables, the configured port is eth0).

Sorry my screenshot mislead you. Actually bare-metal2 failed on eth0 for iPXE request firstly, then attempted with eth1. I pasted the screenshot of eth1. sorry it wasted your time. Here is its screenshot of eth0: PXE-18: Server response timeout

3

Actions #11

Updated by nicksinger 3 months ago

So I think I recovered gonzo and it boots from PXE again. The reason was a wrong boot order. You did some changes on bare-metal2 as well. Was this also some debugging done?

Actions #12

Updated by Julie_CAO 3 months ago · Edited

nicksinger wrote in #note-11:

So I think I recovered gonzo and it boots from PXE again. The reason was a wrong boot order. You did some changes on bare-metal2 as well. Was this also some debugging done?

I did not do any change to bare-metal2, I just copy a correct screenshot from the job link I pasted in ticket description. It still fail with "PXE-E18: Server response timeout`

Has gonzo been fixed? I am going to run a test on it right now. thank you.

Actions #13

Updated by Julie_CAO 3 months ago · Edited

Yes, gonzo is working well now. And amd-zen3-gpu-sut1 turns to be good as well.

I guess other bare-metal machines recover from this iPXE issue too. It must be some network or service problem a few hours ago.

Actions #15

Updated by Julie_CAO 3 months ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

All 12 baremetal machines roughly for SLE15SP7 tests can boot from iPXE correctly. I made changes to:

gonzo:

  • boot order: set "boot from hard disk" as the first order. because in automation tests it will be set ONE-TIME boot from PXE before installation, after installation, it can boot from hard disk normally.
  • diable eth0 in boot as eth1 is configured to get a static IP from DHCP server.

Thank you for looking on this. Since the problem has gone, I mark the ticket done.

Actions #16

Updated by livdywan 3 months ago

  • Assignee set to nicksinger
Actions

Also available in: Atom PDF