action #166394
closediPXE service is unavailable for baremetal SUTs on OSD worker pool size:S
Added by Julie_CAO 4 months ago. Updated 3 months ago.
100%
Description
I run a few tests on ipmi baremetal SUTs but all failed with iPXE connection. Shall I open a SD ticket or tools team can handle it?
https://openqa.suse.de/tests/15345958/video?filename=video.webm
PXE-E18: Server response timeout
https://openqa.suse.de/tests/15348009/video?filename=video.webm
PXE-E61: Media test failure, check cable
PXE-M0F: Exit Intel boot agent
Files
Updated by nicksinger 4 months ago
So the first machine in question is "bare-metal2.oqa.prg2.suse.org" (https://racktables.suse.de/index.php?page=object&tab=default&object_id=23403) which is registered to OSD as worker33:17 (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls?ref_type=heads#L2132-2146). This can be confirmed by the mac visible in the screenshots and also seen in https://openqa.suse.de/tests/15345958. 10.136.53.55
seems unknown to me so I checked what 10.145.10.155
actually is and found: https://gitlab.suse.de/OPS-Service/salt/-/blob/production/salt/profile/dns/files/prg2_suse_org/dns-oqa.prg2.suse.org#L202 - this is a dynamic IP configuration which will use the default of the network visible in: https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/init.sls#L3 which explains why it contacted the wrong server (it is supposed to get its ipxe binary from the "bare metal support server" https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml#L59-64). So clearly the machine is using the wrong interface to attempt booting (eth1 in racktables, the configured port is eth0).
The second machine is gonzo (https://racktables.suse.de/index.php?page=object&tab=default&object_id=10104) which is using the correct interface but apparently has no cable connected. We can check this one.
Updated by nicksinger 4 months ago
nicksinger wrote in #note-3:
The second machine is gonzo (https://racktables.suse.de/index.php?page=object&tab=default&object_id=10104) which is using the correct interface but apparently has no cable connected. We can check this one.
Wait, that is the second interface… the real reason it couldn't boot is: PXE-E11 ARP timeout
Updated by Julie_CAO 3 months ago
Thank you Nick for taking a look.
I'll check bare-meta2
machine.
Another failure of PXE-E11: ARP timeout
on amd-zen3-gpu-sut1
, https://openqa.suse.de/tests/15338865/video?filename=video.webm
Updated by Julie_CAO 3 months ago
nicksinger wrote in #note-3:
So the first machine in question is "bare-metal2.oqa.prg2.suse.org" (https://racktables.suse.de/index.php?page=object&tab=default&object_id=23403) which is registered to OSD as worker33:17 (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls?ref_type=heads#L2132-2146). This can be confirmed by the mac visible in the screenshots and also seen in https://openqa.suse.de/tests/15345958.
10.136.53.55
seems unknown to me so I checked what10.145.10.155
actually is and found: https://gitlab.suse.de/OPS-Service/salt/-/blob/production/salt/profile/dns/files/prg2_suse_org/dns-oqa.prg2.suse.org#L202 - this is a dynamic IP configuration which will use the default of the network visible in: https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/init.sls#L3 which explains why it contacted the wrong server (it is supposed to get its ipxe binary from the "bare metal support server" https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml#L59-64). So clearly the machine is using the wrong interface to attempt booting (eth1 in racktables, the configured port is eth0).
Sorry my screenshot mislead you. Actually bare-metal2
failed on eth0 for iPXE request firstly, then attempted with eth1. I pasted the screenshot of eth1. sorry it wasted your time. Here is its screenshot of eth0: PXE-18: Server response timeout
Updated by nicksinger 3 months ago
So I think I recovered gonzo and it boots from PXE again. The reason was a wrong boot order. You did some changes on bare-metal2
as well. Was this also some debugging done?
Updated by Julie_CAO 3 months ago · Edited
nicksinger wrote in #note-11:
So I think I recovered gonzo and it boots from PXE again. The reason was a wrong boot order. You did some changes on
bare-metal2
as well. Was this also some debugging done?
I did not do any change to bare-metal2
, I just copy a correct screenshot from the job link I pasted in ticket description. It still fail with "PXE-E18: Server response timeout`
Has gonzo
been fixed? I am going to run a test on it right now. thank you.
Updated by Julie_CAO 3 months ago
I triggered a few more tests. I'll check them tomorrow. thank you for the fix.
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP7&build=10.7&groupid=263
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP7&build=10.7&groupid=264
Updated by Julie_CAO 3 months ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
All 12 baremetal machines roughly for SLE15SP7 tests can boot from iPXE correctly. I made changes to:
gonzo
:
- boot order: set "boot from hard disk" as the first order. because in automation tests it will be set ONE-TIME boot from PXE before installation, after installation, it can boot from hard disk normally.
- diable
eth0
in boot aseth1
is configured to get a static IP from DHCP server.
Thank you for looking on this. Since the problem has gone, I mark the ticket done.