Project

General

Profile

Actions

action #163529

open

redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) size:S

Added by tjyrinki_suse 18 days ago. Updated 5 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-07-09
Due date:
% Done:

0%

Estimated time:

Description

https://progress.opensuse.org/issues/163529
redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access)

Observation

pvm_hmc workers have lost OS / grub access, so no tests are able to be run at the moment.

Look at a same job, it happened between June 17th (https://openqa.suse.de/tests/14640503#step/bootloader/21) and July 2nd (https://openqa.suse.de/tests/14860440#step/bootloader/15). This can be further narrowed down to having happened after June 25th https://openqa.suse.de/tests/14734966

Suggestions

Actions #1

Updated by okurz 18 days ago

  • Tags changed from ppc64le to ppc64le, infra, reactive work
  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by nicksinger 18 days ago

  • Subject changed from ppc64le pvm_hmc backend workers lost OS / grub access to redcurrant(-3?) is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access)
  • Description updated (diff)
Actions #3

Updated by nicksinger 18 days ago

  • Description updated (diff)
Actions #4

Updated by nicksinger 18 days ago

  • Subject changed from redcurrant(-3?) is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) to redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access)
  • Status changed from New to In Progress
  • Assignee set to nicksinger

The first job I found was https://openqa.suse.de/tests/14779007#step/bootloader_start/20 from 1st of July. I will try to check the LPARs for which server ip is configured in petitboot. Not sure what caused it to change. The new/wrong IP points to "bare-metal4-ipmi" (https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml#L85-88) which might misbehave in the network.

Actions #5

Updated by nicksinger 18 days ago

so indeed petitboot contained the wrong IP. But even after changing it manually it fails to chainload the network grub:

Interpartition Logical LAN: U9008.22L.788201A-V3-C3-T1
 1.   Client IP Address                    [10.145.10.222]
 2.   Server IP Address                    [10.168.192.10]
 3.   Gateway IP Address                   [10.145.10.254]
 4.   Subnet Mask                          [255.255.255.0]

redcurrant-1 can reach the PXE server:

10.145.10.222:    24  bytes from 10.168.192.10:  icmp_seq=10  ttl=? time=21  ms

                              .-----------------.
                              |  Ping  Success. |
                              `-----------------'

 Press any key to continue..........

but fails to load it:

BOOTP Parameters:
----------------
chosen-network-type = ethernet,auto,rj45,auto
server IP           = 10.168.192.10
client IP           = 10.145.10.222
gateway IP          = 10.145.10.254
device              = /vdevice/l-lan@30000003
MAC address         = f6 6b 46 d3 fd 03
loc-code            = U9008.22L.788201A-V3-C3-T1

BOOTP request retry attempt: 1
BOOTP request retry attempt: 2
BOOTP request retry attempt: 3
BOOTP request retry attempt: 4
    !BA01B015 !

cross-checking with worker31 if I can figure out if the dhcp-server behaves correctly. It looks like the "dhcp_filename" is never showing up (I think I saw that in the past in petitboot already).

Actions #6

Updated by nicksinger 18 days ago

I can see according dhcp answers and also fetch "ppc64le/grub2" manually with tftp from "10.168.192.10" (qa-jump) via worker31 so something else is off with power. Checking if I can tftp files from within a petitboot shell.

Actions #7

Updated by nicksinger 18 days ago

with the lpar bootet up in some SUT environment I was able to login and test from there:

susetest:~ # tftp
tftp> connect 10.168.192.10
tftp> get ppc64le/grub2
Transfer timed out.

the server on the other side shows the following in the logs of tftpd.service:

Jul 09 11:55:28 qa-jump systemd[1]: Started Tftp Server.
Jul 09 11:55:28 qa-jump in.tftpd[5799]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:28 qa-jump in.tftpd[5799]: tftpd: read(ack): No route to host
Jul 09 11:55:33 qa-jump in.tftpd[5804]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:33 qa-jump in.tftpd[5804]: tftpd: read(ack): No route to host
Jul 09 11:55:38 qa-jump in.tftpd[5805]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:38 qa-jump in.tftpd[5805]: tftpd: read(ack): No route to host
Jul 09 11:55:43 qa-jump in.tftpd[5806]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:43 qa-jump in.tftpd[5806]: tftpd: read(ack): No route to host
Jul 09 11:55:48 qa-jump in.tftpd[5856]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:48 qa-jump in.tftpd[5856]: tftpd: read(ack): No route to host

a ping from qa-jump to grenache-1 works:

qa-jump:~ # ping 10.145.10.222
PING 10.145.10.222 (10.145.10.222) 56(84) bytes of data.
64 bytes from 10.145.10.222: icmp_seq=1 ttl=62 time=4.71 ms
64 bytes from 10.145.10.222: icmp_seq=2 ttl=62 time=4.73 ms
^C
--- 10.145.10.222 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 4.708/4.718/4.729/0.010 ms

so I'm not sure where this route error comes from

Actions #8

Updated by livdywan 18 days ago

  • Subject changed from redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) to redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) size:S
  • Description updated (diff)
Actions #9

Updated by nicksinger 18 days ago

Found the following on qa-jump:

qa-jump:~ # tcpdump host 10.145.10.222
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:58:37.920253 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:37.922344 IP qa-jump.qe.nue2.suse.org.35379 > 10.145.10.222.35954: UDP, length 516
11:58:37.927188 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:42.920297 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:42.922035 IP qa-jump.qe.nue2.suse.org.43770 > 10.145.10.222.35954: UDP, length 516
11:58:42.926925 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:47.920210 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:47.922127 IP qa-jump.qe.nue2.suse.org.47782 > 10.145.10.222.35954: UDP, length 516
11:58:47.927022 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:52.920088 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:52.921962 IP qa-jump.qe.nue2.suse.org.57527 > 10.145.10.222.35954: UDP, length 516
11:58:52.926861 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:57.919834 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:57.922274 IP qa-jump.qe.nue2.suse.org.60022 > 10.145.10.222.35954: UDP, length 516
11:58:57.927131 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552

I will try if I can spot something on the firewall with instructions provided in https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones and open a SD ticket if necessary.

Actions #10

Updated by openqa_review 17 days ago

  • Due date set to 2024-07-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by nicksinger 12 days ago

  • Status changed from In Progress to Feedback

nicksinger wrote in #note-9:

I will try if I can spot something on the firewall with instructions provided in https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones and open a SD ticket if necessary.

Couldn't find anything relevant in the linked firewall log but I'm not even sure which firewall this shows. I collected a lot more details to file https://sd.suse.com/servicedesk/customer/portal/1/SD-162395

Basically I can currently say:

  • .qe.prg2.suse.org (10.145.10.0/24) is affected
    • worker31 (10.145.10.4) is not affected
  • .oqa.prg2.suse.org (10.145.0.0/21) is affected - this is mainly our PPC network range
  • IPv4 and IPv6 behave mostly the same
  • Could be related to the location PRG2-J11 but having worker31 unaffected seems to rule it out
Actions #12

Updated by nicksinger 12 days ago

  • Status changed from Feedback to Blocked
Actions #13

Updated by nicksinger 11 days ago

  • Status changed from Blocked to In Progress

While collecting more evidence and traces for https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 I realized that the firewall was to blame on the other hosts I tried. Also on the SUT running on redcurrant-1 a firewall was active. With that disabled even redcurrant-1 was able to connect and download from qa-jump. This unfortunately brings me back to square one - that this problem only occurs from within the SMS. Next I will try out more LPARs on grenache to understand if this is happening on a machine level or only for specific LPARs. On machine level could point to an HMC or VIOS problem. Only on specific LPARs could point to some strange "corruption" of some persistent storage (we had something similar in the past already).

Actions #14

Updated by nicksinger 10 days ago

  • Status changed from In Progress to Blocked

Made some progress while checking other redcurrant LPARs. I discovered that in every working condition I tried to reach the tftp server directly (port 69) but the SMS menu of PPC machines fail requests on port 67. So DHCP is already not working (I guess it is required for the filename? Everything else is "hardconfigured" in the SMS menu).

I checked suttner1&2 and indeed see problems there. I added this to https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 and asked for help again.

Actions #15

Updated by nicksinger 9 days ago

@okurz had a chance to speak with Gerhard who applied the same salt-state on both machines. No change in behavior though and redcurrant-1 can still not fetch its PXE binary.

Actions #16

Updated by nicksinger 5 days ago

Quick ping in the ticket again. Situation unchanged

Actions #17

Updated by okurz 5 days ago

  • Due date deleted (2024-07-24)
  • Priority changed from High to Normal

No quick resolution within https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 is expected.

Actions

Also available in: Atom PDF