action #163529
closedredcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) size:S
0%
Description
Observation¶
pvm_hmc workers have lost OS / grub access, so no tests are able to be run at the moment.
Look at a same job, it happened between June 17th (https://openqa.suse.de/tests/14640503#step/bootloader/21) and July 2nd (https://openqa.suse.de/tests/14860440#step/bootloader/15). This can be further narrowed down to having happened after June 25th https://openqa.suse.de/tests/14734966
Suggestions¶
- DONE Check if other redcurrant machines also fail to boot from network
- Happens on redcurrant-4 as well: https://openqa.suse.de/tests/14788465#step/bootloader_start/17
- Why is https://openqa.suse.de/tests/14860440#step/bootloader/21 showing 10.145.10.21 as server ip (should be 10.168.192.10 set by the "dhcp_next_server" in https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml#L441)
- Check firewall logs following https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones
Updated by nicksinger 6 months ago
- Subject changed from ppc64le pvm_hmc backend workers lost OS / grub access to redcurrant(-3?) is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access)
- Description updated (diff)
Updated by nicksinger 6 months ago
- Subject changed from redcurrant(-3?) is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) to redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access)
- Status changed from New to In Progress
- Assignee set to nicksinger
The first job I found was https://openqa.suse.de/tests/14779007#step/bootloader_start/20 from 1st of July. I will try to check the LPARs for which server ip is configured in petitboot. Not sure what caused it to change. The new/wrong IP points to "bare-metal4-ipmi" (https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml#L85-88) which might misbehave in the network.
Updated by nicksinger 6 months ago
so indeed petitboot contained the wrong IP. But even after changing it manually it fails to chainload the network grub:
Interpartition Logical LAN: U9008.22L.788201A-V3-C3-T1
1. Client IP Address [10.145.10.222]
2. Server IP Address [10.168.192.10]
3. Gateway IP Address [10.145.10.254]
4. Subnet Mask [255.255.255.0]
redcurrant-1 can reach the PXE server:
10.145.10.222: 24 bytes from 10.168.192.10: icmp_seq=10 ttl=? time=21 ms
.-----------------.
| Ping Success. |
`-----------------'
Press any key to continue..........
but fails to load it:
BOOTP Parameters:
----------------
chosen-network-type = ethernet,auto,rj45,auto
server IP = 10.168.192.10
client IP = 10.145.10.222
gateway IP = 10.145.10.254
device = /vdevice/l-lan@30000003
MAC address = f6 6b 46 d3 fd 03
loc-code = U9008.22L.788201A-V3-C3-T1
BOOTP request retry attempt: 1
BOOTP request retry attempt: 2
BOOTP request retry attempt: 3
BOOTP request retry attempt: 4
!BA01B015 !
cross-checking with worker31 if I can figure out if the dhcp-server behaves correctly. It looks like the "dhcp_filename" is never showing up (I think I saw that in the past in petitboot already).
Updated by nicksinger 6 months ago
I can see according dhcp answers and also fetch "ppc64le/grub2" manually with tftp from "10.168.192.10" (qa-jump) via worker31 so something else is off with power. Checking if I can tftp files from within a petitboot shell.
Updated by nicksinger 6 months ago
with the lpar bootet up in some SUT environment I was able to login and test from there:
susetest:~ # tftp
tftp> connect 10.168.192.10
tftp> get ppc64le/grub2
Transfer timed out.
the server on the other side shows the following in the logs of tftpd.service:
Jul 09 11:55:28 qa-jump systemd[1]: Started Tftp Server.
Jul 09 11:55:28 qa-jump in.tftpd[5799]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:28 qa-jump in.tftpd[5799]: tftpd: read(ack): No route to host
Jul 09 11:55:33 qa-jump in.tftpd[5804]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:33 qa-jump in.tftpd[5804]: tftpd: read(ack): No route to host
Jul 09 11:55:38 qa-jump in.tftpd[5805]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:38 qa-jump in.tftpd[5805]: tftpd: read(ack): No route to host
Jul 09 11:55:43 qa-jump in.tftpd[5806]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:43 qa-jump in.tftpd[5806]: tftpd: read(ack): No route to host
Jul 09 11:55:48 qa-jump in.tftpd[5856]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:48 qa-jump in.tftpd[5856]: tftpd: read(ack): No route to host
a ping from qa-jump to grenache-1 works:
qa-jump:~ # ping 10.145.10.222
PING 10.145.10.222 (10.145.10.222) 56(84) bytes of data.
64 bytes from 10.145.10.222: icmp_seq=1 ttl=62 time=4.71 ms
64 bytes from 10.145.10.222: icmp_seq=2 ttl=62 time=4.73 ms
^C
--- 10.145.10.222 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 4.708/4.718/4.729/0.010 ms
so I'm not sure where this route error comes from
Updated by livdywan 6 months ago
- Subject changed from redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) to redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) size:S
- Description updated (diff)
Updated by nicksinger 6 months ago
Found the following on qa-jump:
qa-jump:~ # tcpdump host 10.145.10.222
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:58:37.920253 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:37.922344 IP qa-jump.qe.nue2.suse.org.35379 > 10.145.10.222.35954: UDP, length 516
11:58:37.927188 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:42.920297 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:42.922035 IP qa-jump.qe.nue2.suse.org.43770 > 10.145.10.222.35954: UDP, length 516
11:58:42.926925 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:47.920210 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:47.922127 IP qa-jump.qe.nue2.suse.org.47782 > 10.145.10.222.35954: UDP, length 516
11:58:47.927022 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:52.920088 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:52.921962 IP qa-jump.qe.nue2.suse.org.57527 > 10.145.10.222.35954: UDP, length 516
11:58:52.926861 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:57.919834 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:57.922274 IP qa-jump.qe.nue2.suse.org.60022 > 10.145.10.222.35954: UDP, length 516
11:58:57.927131 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
I will try if I can spot something on the firewall with instructions provided in https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones and open a SD ticket if necessary.
Updated by openqa_review 6 months ago
- Due date set to 2024-07-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 5 months ago
- Status changed from In Progress to Feedback
nicksinger wrote in #note-9:
I will try if I can spot something on the firewall with instructions provided in https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones and open a SD ticket if necessary.
Couldn't find anything relevant in the linked firewall log but I'm not even sure which firewall this shows. I collected a lot more details to file https://sd.suse.com/servicedesk/customer/portal/1/SD-162395
Basically I can currently say:
- .qe.prg2.suse.org (
10.145.10.0/24
) is affected- worker31 (
10.145.10.4
) is not affected
- worker31 (
- .oqa.prg2.suse.org (
10.145.0.0/21
) is affected - this is mainly our PPC network range - IPv4 and IPv6 behave mostly the same
- Could be related to the location PRG2-J11 but having worker31 unaffected seems to rule it out
Updated by nicksinger 5 months ago
- Status changed from Blocked to In Progress
While collecting more evidence and traces for https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 I realized that the firewall was to blame on the other hosts I tried. Also on the SUT running on redcurrant-1 a firewall was active. With that disabled even redcurrant-1 was able to connect and download from qa-jump. This unfortunately brings me back to square one - that this problem only occurs from within the SMS. Next I will try out more LPARs on grenache to understand if this is happening on a machine level or only for specific LPARs. On machine level could point to an HMC or VIOS problem. Only on specific LPARs could point to some strange "corruption" of some persistent storage (we had something similar in the past already).
Updated by nicksinger 5 months ago
- Status changed from In Progress to Blocked
Made some progress while checking other redcurrant LPARs. I discovered that in every working condition I tried to reach the tftp server directly (port 69) but the SMS menu of PPC machines fail requests on port 67. So DHCP is already not working (I guess it is required for the filename? Everything else is "hardconfigured" in the SMS menu).
I checked suttner1&2 and indeed see problems there. I added this to https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 and asked for help again.
Updated by nicksinger 5 months ago
@okurz had a chance to speak with Gerhard who applied the same salt-state on both machines. No change in behavior though and redcurrant-1 can still not fetch its PXE binary.
Updated by nicksinger 5 months ago
Quick ping in the ticket again. Situation unchanged
Updated by okurz 5 months ago
- Due date deleted (
2024-07-24) - Priority changed from High to Normal
No quick resolution within https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 is expected.
Updated by nicksinger 4 months ago
- Status changed from Blocked to In Progress
Both suttners should be fixed again. Going to check if anything changed with with this machine.
Updated by openqa_review 4 months ago
- Due date set to 2024-09-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 4 months ago
I was not able to power on the machine via the HMC after both suttners have been fixed so I was not able to verify the fix from https://sd.suse.com/servicedesk/customer/portal/1/SD-162395
Updated by nicksinger 3 months ago
- Status changed from Workable to In Progress
the situation with the machine got worse and worse with every interaction I tried. First the LPAR didn't start (timeout in the HMC web dialogue). A reset of the ASM connection didn't change anything. Rebooting the VIOS didn't work because the HMC again timed out. After receiving a report in Slack that the machine is now offline, I tried to restart the whole hypervisor which again resulted in timeouts while trying to start the VIOS. I then tried to restart the ASM which now results in a completely unstable ASM connection:
From 10.255.255.1 icmp_seq=6742 Destination Host Unreachable
64 bytes from 10.255.255.153: icmp_seq=6759 ttl=64 time=0.263 ms
64 bytes from 10.255.255.153: icmp_seq=6774 ttl=64 time=0.387 ms
64 bytes from 10.255.255.153: icmp_seq=6789 ttl=64 time=0.238 ms
64 bytes from 10.255.255.153: icmp_seq=6790 ttl=64 time=0.282 ms
64 bytes from 10.255.255.153: icmp_seq=6791 ttl=64 time=0.256 ms
64 bytes from 10.255.255.153: icmp_seq=6805 ttl=64 time=0.238 ms
64 bytes from 10.255.255.153: icmp_seq=6806 ttl=64 time=0.284 ms
From 10.255.255.1 icmp_seq=6818 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6819 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6820 Destination Host Unreachable
64 bytes from 10.255.255.153: icmp_seq=6821 ttl=64 time=1024 ms
64 bytes from 10.255.255.153: icmp_seq=6822 ttl=64 time=0.363 ms
64 bytes from 10.255.255.153: icmp_seq=6836 ttl=64 time=0.284 ms
64 bytes from 10.255.255.153: icmp_seq=6837 ttl=64 time=0.300 ms
From 10.255.255.1 icmp_seq=6849 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6850 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6851 Destination Host Unreachable
64 bytes from 10.255.255.153: icmp_seq=6854 ttl=64 time=0.238 ms
From 10.255.255.1 icmp_seq=6880 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6881 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6882 Destination Host Unreachable
64 bytes from 10.255.255.153: icmp_seq=6899 ttl=64 time=0.326 ms
From 10.255.255.1 icmp_seq=6910 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6911 Destination Host Unreachable
According to racktables the machine is connected to PDU-A-J11 and PDU-B-J11 which (again according to racktables) are both in the .mgmt.prg2.suse.org
-domain both unreachable for me. I consider opening a SD ticket to at least reset both sockets or even request general access to these PDUs if possible.
I also want to explore the possibility if the ASM is somehow misbehaving caused by some weird network bug in the "management network" for the HMC because we observe general stability problems in this (rather simple) network.
Updated by openqa_review 3 months ago
- Due date set to 2024-09-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 3 months ago
- Related to action #165782: [openQA][infra][ipxe][uefi][initrd] UEFI iPXE Machine fails to load initrd size:S added
Updated by nicksinger 3 months ago
- Status changed from In Progress to Feedback
I completely removed the machine from the HMC now. After doing a discovery scan I found a couple of IPs and found an ASM with 10.255.255.190
which is redcurrant but a different one then mentioned in https://progress.opensuse.org/issues/163529#note-24 - adding this one allowed for a super stable HMC connection again so we might had a duplicate connection again without the HMC showing the typical behavior (maybe this was "fixed"/changed by a minor HMC update conducted by me some months ago?). Anyhow, I enabled redcurrant-1 and tried network boot with the following parameters in SMS:
TFTP BOOT ---------------------------------------------------
Server IP.....................10.168.192.10
Client IP.....................10.145.10.222
Gateway IP....................10.145.10.254
Subnet Mask...................255.255.255.0
( 1 ) Filename.................ppc64le/grub2
TFTP Retries..................5
FINAL PACKET COUNT = 365 .....512 PACKET COUNT = 100
I cloned a recent job to validate how the LPAR behaves now:
# openqa-clone-job --skip-download --skip-chained-deps --within-instance https://openqa.suse.de 15403013 WORKER_CLASS=redcurrant-1 _GROUP=0 BUILD+=poo#163529
1 job has been created:
- sle-15-SP7-Online-ppc64le-Build18.1-RAID0_test@ppc64le-hmc-4disk -> https://openqa.suse.de/tests/15427206
Currently the worker slot is waiting for a lower load on worker29 (which is controlling redcurrant-1).
Updated by okurz 3 months ago
- Due date deleted (
2024-09-27) - Status changed from Feedback to Resolved
https://openqa.suse.de/tests/15427206 booted fine and all worker slots are back in production. There are not more recent production jobs so if anyone encounters problems let us know.
Updated by openqa_review 3 months ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: autoyast_create_hdd_textmode_qesec
https://openqa.suse.de/tests/15533637#step/bootloader_start/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by okurz 3 months ago
- Status changed from Feedback to In Progress
- Assignee changed from nicksinger to okurz
I did openqa-query-for-job-label poo#163529
will check
15537823|2024-09-26 09:59:10|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537542|2024-09-26 09:19:25|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537463|2024-09-26 08:39:43|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537350|2024-09-26 08:00:32|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537247|2024-09-26 07:20:22|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537112|2024-09-26 06:41:19|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15536904|2024-09-26 06:00:30|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15536404|2024-09-26 05:00:49|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15533650|2024-09-26 02:43:17|done|failed|fips_install_lvm_full_encrypt||worker29
15533637|2024-09-26 02:16:57|done|failed|autoyast_create_hdd_textmode_qesec||worker29
Updated by okurz 3 months ago
- Status changed from In Progress to Resolved
- Assignee changed from okurz to nicksinger
On the first I added a comment "https://openqa.suse.de/tests/15537823#step/bootloader_start/29 shows that we ended up within a properly booted Linux system but fail later in the installer due to not reachable/readable installation repositories. This is not label:poo#163529 anymore" and removed the wrong taken over bugref. Equivalent for the other jobs.
Updated by openqa_review 2 months ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: autoyast_create_hdd_gnome_qesec
https://openqa.suse.de/tests/15238123#step/bootloader_start/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by openqa_review about 2 months ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: create_hdd_hmc_fips_ker_mode
https://openqa.suse.de/tests/15809850#step/bootloader/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by nicksinger about 2 months ago
- Status changed from Feedback to Resolved
Ran
workstation git/scripts ‹master› » ./openqa-query-for-job-label poo#163529
15809858|2024-10-30 10:57:32|done|failed|fips_install_lvm_full_encrypt||worker29
15809850|2024-10-30 10:38:56|done|failed|create_hdd_hmc_fips_ker_mode||worker29
and removed wrong references again.
Updated by openqa_review about 1 month ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: create_hdd_hmc_fips_env_mode
https://openqa.suse.de/tests/15927658#step/bootloader/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by nicksinger 26 days ago
- Status changed from Feedback to Resolved
openqa_review wrote in #note-36:
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: create_hdd_hmc_fips_env_mode
https://openqa.suse.de/tests/15927658#step/bootloader/1To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
removed wrong references