Project

General

Profile

Actions

action #163529

open

redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) size:S

Added by tjyrinki_suse 5 months ago. Updated 1 day ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-07-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

pvm_hmc workers have lost OS / grub access, so no tests are able to be run at the moment.

Look at a same job, it happened between June 17th (https://openqa.suse.de/tests/14640503#step/bootloader/21) and July 2nd (https://openqa.suse.de/tests/14860440#step/bootloader/15). This can be further narrowed down to having happened after June 25th https://openqa.suse.de/tests/14734966

Suggestions


Related issues 1 (1 open0 closed)

Related to openQA Infrastructure - action #165782: [openQA][infra][ipxe][uefi][initrd] UEFI iPXE Machine fails to load initrd size:SNew2024-08-26

Actions
Actions #1

Updated by okurz 5 months ago

  • Tags changed from ppc64le to ppc64le, infra, reactive work
  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by nicksinger 5 months ago

  • Subject changed from ppc64le pvm_hmc backend workers lost OS / grub access to redcurrant(-3?) is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access)
  • Description updated (diff)
Actions #3

Updated by nicksinger 5 months ago

  • Description updated (diff)
Actions #4

Updated by nicksinger 5 months ago

  • Subject changed from redcurrant(-3?) is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) to redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access)
  • Status changed from New to In Progress
  • Assignee set to nicksinger

The first job I found was https://openqa.suse.de/tests/14779007#step/bootloader_start/20 from 1st of July. I will try to check the LPARs for which server ip is configured in petitboot. Not sure what caused it to change. The new/wrong IP points to "bare-metal4-ipmi" (https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml#L85-88) which might misbehave in the network.

Actions #5

Updated by nicksinger 5 months ago

so indeed petitboot contained the wrong IP. But even after changing it manually it fails to chainload the network grub:

Interpartition Logical LAN: U9008.22L.788201A-V3-C3-T1
 1.   Client IP Address                    [10.145.10.222]
 2.   Server IP Address                    [10.168.192.10]
 3.   Gateway IP Address                   [10.145.10.254]
 4.   Subnet Mask                          [255.255.255.0]

redcurrant-1 can reach the PXE server:

10.145.10.222:    24  bytes from 10.168.192.10:  icmp_seq=10  ttl=? time=21  ms

                              .-----------------.
                              |  Ping  Success. |
                              `-----------------'

 Press any key to continue..........

but fails to load it:

BOOTP Parameters:
----------------
chosen-network-type = ethernet,auto,rj45,auto
server IP           = 10.168.192.10
client IP           = 10.145.10.222
gateway IP          = 10.145.10.254
device              = /vdevice/l-lan@30000003
MAC address         = f6 6b 46 d3 fd 03
loc-code            = U9008.22L.788201A-V3-C3-T1

BOOTP request retry attempt: 1
BOOTP request retry attempt: 2
BOOTP request retry attempt: 3
BOOTP request retry attempt: 4
    !BA01B015 !

cross-checking with worker31 if I can figure out if the dhcp-server behaves correctly. It looks like the "dhcp_filename" is never showing up (I think I saw that in the past in petitboot already).

Actions #6

Updated by nicksinger 5 months ago

I can see according dhcp answers and also fetch "ppc64le/grub2" manually with tftp from "10.168.192.10" (qa-jump) via worker31 so something else is off with power. Checking if I can tftp files from within a petitboot shell.

Actions #7

Updated by nicksinger 5 months ago

with the lpar bootet up in some SUT environment I was able to login and test from there:

susetest:~ # tftp
tftp> connect 10.168.192.10
tftp> get ppc64le/grub2
Transfer timed out.

the server on the other side shows the following in the logs of tftpd.service:

Jul 09 11:55:28 qa-jump systemd[1]: Started Tftp Server.
Jul 09 11:55:28 qa-jump in.tftpd[5799]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:28 qa-jump in.tftpd[5799]: tftpd: read(ack): No route to host
Jul 09 11:55:33 qa-jump in.tftpd[5804]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:33 qa-jump in.tftpd[5804]: tftpd: read(ack): No route to host
Jul 09 11:55:38 qa-jump in.tftpd[5805]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:38 qa-jump in.tftpd[5805]: tftpd: read(ack): No route to host
Jul 09 11:55:43 qa-jump in.tftpd[5806]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:43 qa-jump in.tftpd[5806]: tftpd: read(ack): No route to host
Jul 09 11:55:48 qa-jump in.tftpd[5856]: RRQ from 10.145.10.222 filename ppc64le/grub2
Jul 09 11:55:48 qa-jump in.tftpd[5856]: tftpd: read(ack): No route to host

a ping from qa-jump to grenache-1 works:

qa-jump:~ # ping 10.145.10.222
PING 10.145.10.222 (10.145.10.222) 56(84) bytes of data.
64 bytes from 10.145.10.222: icmp_seq=1 ttl=62 time=4.71 ms
64 bytes from 10.145.10.222: icmp_seq=2 ttl=62 time=4.73 ms
^C
--- 10.145.10.222 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 4.708/4.718/4.729/0.010 ms

so I'm not sure where this route error comes from

Actions #8

Updated by livdywan 5 months ago

  • Subject changed from redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) to redcurrant is unable to boot from PXE server (was: ppc64le pvm_hmc backend workers lost OS / grub access) size:S
  • Description updated (diff)
Actions #9

Updated by nicksinger 5 months ago

Found the following on qa-jump:

qa-jump:~ # tcpdump host 10.145.10.222
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:58:37.920253 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:37.922344 IP qa-jump.qe.nue2.suse.org.35379 > 10.145.10.222.35954: UDP, length 516
11:58:37.927188 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:42.920297 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:42.922035 IP qa-jump.qe.nue2.suse.org.43770 > 10.145.10.222.35954: UDP, length 516
11:58:42.926925 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:47.920210 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:47.922127 IP qa-jump.qe.nue2.suse.org.47782 > 10.145.10.222.35954: UDP, length 516
11:58:47.927022 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:52.920088 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:52.921962 IP qa-jump.qe.nue2.suse.org.57527 > 10.145.10.222.35954: UDP, length 516
11:58:52.926861 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552
11:58:57.919834 IP 10.145.10.222.35954 > qa-jump.qe.nue2.suse.org.tftp: TFTP, length 25, RRQ "ppc64le/grub2" netascii
11:58:57.922274 IP qa-jump.qe.nue2.suse.org.60022 > 10.145.10.222.35954: UDP, length 516
11:58:57.927131 IP 10.145.10.222 > qa-jump.qe.nue2.suse.org: ICMP host 10.145.10.222 unreachable - admin prohibited filter, length 552

I will try if I can spot something on the firewall with instructions provided in https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones and open a SD ticket if necessary.

Actions #10

Updated by openqa_review 4 months ago

  • Due date set to 2024-07-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by nicksinger 4 months ago

  • Status changed from In Progress to Feedback

nicksinger wrote in #note-9:

I will try if I can spot something on the firewall with instructions provided in https://wiki.suse.net/index.php/OpenQA#Firewall_between_different_SUSE_network_zones and open a SD ticket if necessary.

Couldn't find anything relevant in the linked firewall log but I'm not even sure which firewall this shows. I collected a lot more details to file https://sd.suse.com/servicedesk/customer/portal/1/SD-162395

Basically I can currently say:

  • .qe.prg2.suse.org (10.145.10.0/24) is affected
    • worker31 (10.145.10.4) is not affected
  • .oqa.prg2.suse.org (10.145.0.0/21) is affected - this is mainly our PPC network range
  • IPv4 and IPv6 behave mostly the same
  • Could be related to the location PRG2-J11 but having worker31 unaffected seems to rule it out
Actions #12

Updated by nicksinger 4 months ago

  • Status changed from Feedback to Blocked
Actions #13

Updated by nicksinger 4 months ago

  • Status changed from Blocked to In Progress

While collecting more evidence and traces for https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 I realized that the firewall was to blame on the other hosts I tried. Also on the SUT running on redcurrant-1 a firewall was active. With that disabled even redcurrant-1 was able to connect and download from qa-jump. This unfortunately brings me back to square one - that this problem only occurs from within the SMS. Next I will try out more LPARs on grenache to understand if this is happening on a machine level or only for specific LPARs. On machine level could point to an HMC or VIOS problem. Only on specific LPARs could point to some strange "corruption" of some persistent storage (we had something similar in the past already).

Actions #14

Updated by nicksinger 4 months ago

  • Status changed from In Progress to Blocked

Made some progress while checking other redcurrant LPARs. I discovered that in every working condition I tried to reach the tftp server directly (port 69) but the SMS menu of PPC machines fail requests on port 67. So DHCP is already not working (I guess it is required for the filename? Everything else is "hardconfigured" in the SMS menu).

I checked suttner1&2 and indeed see problems there. I added this to https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 and asked for help again.

Actions #15

Updated by nicksinger 4 months ago

@okurz had a chance to speak with Gerhard who applied the same salt-state on both machines. No change in behavior though and redcurrant-1 can still not fetch its PXE binary.

Actions #16

Updated by nicksinger 4 months ago

Quick ping in the ticket again. Situation unchanged

Actions #17

Updated by okurz 4 months ago

  • Due date deleted (2024-07-24)
  • Priority changed from High to Normal

No quick resolution within https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 is expected.

Actions #18

Updated by livdywan 4 months ago

okurz wrote in #note-17:

No quick resolution within https://sd.suse.com/servicedesk/customer/portal/1/SD-162395 is expected.

No response so far

Actions #19

Updated by nicksinger 3 months ago

  • Status changed from Blocked to In Progress

Both suttners should be fixed again. Going to check if anything changed with with this machine.

Actions #20

Updated by openqa_review 3 months ago

  • Due date set to 2024-09-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by livdywan 3 months ago

  • Description updated (diff)
Actions #22

Updated by livdywan 3 months ago

  • Due date deleted (2024-09-07)
  • Status changed from In Progress to Workable

I'm guessing this is not in progress right now in favor of more urgent tasks

Actions #23

Updated by nicksinger 3 months ago

I was not able to power on the machine via the HMC after both suttners have been fixed so I was not able to verify the fix from https://sd.suse.com/servicedesk/customer/portal/1/SD-162395

Actions #24

Updated by nicksinger 2 months ago

  • Status changed from Workable to In Progress

the situation with the machine got worse and worse with every interaction I tried. First the LPAR didn't start (timeout in the HMC web dialogue). A reset of the ASM connection didn't change anything. Rebooting the VIOS didn't work because the HMC again timed out. After receiving a report in Slack that the machine is now offline, I tried to restart the whole hypervisor which again resulted in timeouts while trying to start the VIOS. I then tried to restart the ASM which now results in a completely unstable ASM connection:

From 10.255.255.1 icmp_seq=6742 Destination Host Unreachable
64 bytes from 10.255.255.153: icmp_seq=6759 ttl=64 time=0.263 ms
64 bytes from 10.255.255.153: icmp_seq=6774 ttl=64 time=0.387 ms
64 bytes from 10.255.255.153: icmp_seq=6789 ttl=64 time=0.238 ms
64 bytes from 10.255.255.153: icmp_seq=6790 ttl=64 time=0.282 ms
64 bytes from 10.255.255.153: icmp_seq=6791 ttl=64 time=0.256 ms
64 bytes from 10.255.255.153: icmp_seq=6805 ttl=64 time=0.238 ms
64 bytes from 10.255.255.153: icmp_seq=6806 ttl=64 time=0.284 ms
From 10.255.255.1 icmp_seq=6818 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6819 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6820 Destination Host Unreachable
64 bytes from 10.255.255.153: icmp_seq=6821 ttl=64 time=1024 ms
64 bytes from 10.255.255.153: icmp_seq=6822 ttl=64 time=0.363 ms
64 bytes from 10.255.255.153: icmp_seq=6836 ttl=64 time=0.284 ms
64 bytes from 10.255.255.153: icmp_seq=6837 ttl=64 time=0.300 ms
From 10.255.255.1 icmp_seq=6849 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6850 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6851 Destination Host Unreachable
64 bytes from 10.255.255.153: icmp_seq=6854 ttl=64 time=0.238 ms
From 10.255.255.1 icmp_seq=6880 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6881 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6882 Destination Host Unreachable
64 bytes from 10.255.255.153: icmp_seq=6899 ttl=64 time=0.326 ms
From 10.255.255.1 icmp_seq=6910 Destination Host Unreachable
From 10.255.255.1 icmp_seq=6911 Destination Host Unreachable

According to racktables the machine is connected to PDU-A-J11 and PDU-B-J11 which (again according to racktables) are both in the .mgmt.prg2.suse.org-domain both unreachable for me. I consider opening a SD ticket to at least reset both sockets or even request general access to these PDUs if possible.

I also want to explore the possibility if the ASM is somehow misbehaving caused by some weird network bug in the "management network" for the HMC because we observe general stability problems in this (rather simple) network.

Actions #25

Updated by openqa_review 2 months ago

  • Due date set to 2024-09-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #26

Updated by okurz 2 months ago

  • Related to action #165782: [openQA][infra][ipxe][uefi][initrd] UEFI iPXE Machine fails to load initrd size:S added
Actions #27

Updated by nicksinger 2 months ago

  • Status changed from In Progress to Feedback

I completely removed the machine from the HMC now. After doing a discovery scan I found a couple of IPs and found an ASM with 10.255.255.190 which is redcurrant but a different one then mentioned in https://progress.opensuse.org/issues/163529#note-24 - adding this one allowed for a super stable HMC connection again so we might had a duplicate connection again without the HMC showing the typical behavior (maybe this was "fixed"/changed by a minor HMC update conducted by me some months ago?). Anyhow, I enabled redcurrant-1 and tried network boot with the following parameters in SMS:

TFTP BOOT ---------------------------------------------------
Server IP.....................10.168.192.10
Client IP.....................10.145.10.222
Gateway IP....................10.145.10.254
Subnet Mask...................255.255.255.0
( 1  ) Filename.................ppc64le/grub2
TFTP Retries..................5
FINAL PACKET COUNT = 365 .....512 PACKET COUNT = 100

I cloned a recent job to validate how the LPAR behaves now:

# openqa-clone-job --skip-download --skip-chained-deps --within-instance https://openqa.suse.de 15403013 WORKER_CLASS=redcurrant-1 _GROUP=0 BUILD+=poo#163529
1 job has been created:
 - sle-15-SP7-Online-ppc64le-Build18.1-RAID0_test@ppc64le-hmc-4disk -> https://openqa.suse.de/tests/15427206

Currently the worker slot is waiting for a lower load on worker29 (which is controlling redcurrant-1).

Actions #28

Updated by okurz 2 months ago

  • Due date deleted (2024-09-27)
  • Status changed from Feedback to Resolved

https://openqa.suse.de/tests/15427206 booted fine and all worker slots are back in production. There are not more recent production jobs so if anyone encounters problems let us know.

Actions #29

Updated by openqa_review about 2 months ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoyast_create_hdd_textmode_qesec
https://openqa.suse.de/tests/15533637#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #30

Updated by okurz about 2 months ago

  • Status changed from Feedback to In Progress
  • Assignee changed from nicksinger to okurz

I did openqa-query-for-job-label poo#163529

will check

15537823|2024-09-26 09:59:10|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537542|2024-09-26 09:19:25|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537463|2024-09-26 08:39:43|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537350|2024-09-26 08:00:32|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537247|2024-09-26 07:20:22|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15537112|2024-09-26 06:41:19|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15536904|2024-09-26 06:00:30|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15536404|2024-09-26 05:00:49|done|failed|create_hdd_textmode_hmc_ntlm||worker29
15533650|2024-09-26 02:43:17|done|failed|fips_install_lvm_full_encrypt||worker29
15533637|2024-09-26 02:16:57|done|failed|autoyast_create_hdd_textmode_qesec||worker29
Actions #31

Updated by okurz about 2 months ago

  • Status changed from In Progress to Resolved
  • Assignee changed from okurz to nicksinger

On the first I added a comment "https://openqa.suse.de/tests/15537823#step/bootloader_start/29 shows that we ended up within a properly booted Linux system but fail later in the installer due to not reachable/readable installation repositories. This is not label:poo#163529 anymore" and removed the wrong taken over bugref. Equivalent for the other jobs.

Actions #32

Updated by openqa_review about 1 month ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: autoyast_create_hdd_gnome_qesec
https://openqa.suse.de/tests/15238123#step/bootloader_start/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #33

Updated by nicksinger about 1 month ago

  • Status changed from Feedback to Resolved

wrong reference removed

Actions #34

Updated by openqa_review 17 days ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_hmc_fips_ker_mode
https://openqa.suse.de/tests/15809850#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #35

Updated by nicksinger 16 days ago

  • Status changed from Feedback to Resolved

Ran

workstation git/scripts ‹master› » ./openqa-query-for-job-label poo#163529
15809858|2024-10-30 10:57:32|done|failed|fips_install_lvm_full_encrypt||worker29
15809850|2024-10-30 10:38:56|done|failed|create_hdd_hmc_fips_ker_mode||worker29

and removed wrong references again.

Actions #36

Updated by openqa_review 1 day ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_hmc_fips_env_mode
https://openqa.suse.de/tests/15927658#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Also available in: Atom PDF