Project

General

Profile

Actions

action #126821

closed

[openQA][infra][worker] ppc64 fails to load grub2 completely over tftp/pxe on qanet and PXE load timeout issues size:M

Added by lansuse about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-03-28
Due date:
2023-05-05
% Done:

0%

Estimated time:
Tags:

Description

Observation

spvm worker cannot display grub menu properly like as below:
https://openqa.suse.de/tests/10807175#step/bootloader/18

I noticed grenache-1:3 and grenache-1:7 can bootup normally, but grenache-1:1, grenache-1:2, grenache-1:4, grenache-1:5, grenache-1:6 cannot work.

Steps to reproduce

Expected result

Suggestions

  • DONE Remove affected machines from salt
  • Manually boot machines and crosscheck on TFTP server
  • Check if the problem can be reproduced or not reproduce on both qanet and qa-jump.qe.suse2.nue.org

Rollback steps

  • Add machines back to production in salt and verify in production

Files

grub.png (32.8 KB) grub.png No grub configuration was provided to grub, grub finished, nothing was booted. JERiveraMoya, 2023-04-14 05:30

Related issues 3 (2 open1 closed)

Related to openQA Infrastructure - action #125219: Use qa-power8 for ppc tests in o3 - try the other suggestionsNew2023-03-01

Actions
Related to QA - action #128045: /var on qanet is 100%Resolvedokurz2023-04-20

Actions
Copied to openQA Tests - action #129670: [qe-core] Reset ppc nvram before trying to boot Linux kernel to prevent PXE related file reading timeoutsNew

Actions
Actions #1

Updated by lansuse about 1 year ago

I collected some logs from grenache-1:3, it looks like a ssh issue

[2023-03-30T09:41:31.421356+02:00] [debug] [pid:754318] Wait for SSH on host grenache-4.qa.suse.de (timeout: 120)
[2023-03-30T09:43:31.511924+02:00] [debug] [pid:754318] grenache-4.qa.suse.de does not seems to have an active SSH server. Continuing anyway.
xterm PID is 757340
[2023-03-30T09:43:31.519706+02:00] [debug] [pid:754318] <<< backend::baseclass::start_ssh_serial(hostname="grenache-4.qa.suse.de", username="root", password="SECRET")
[2023-03-30T09:43:31.520774+02:00] [debug] [pid:754318] <<< backend::baseclass::new_ssh_connection(password="SECRET", hostname="grenache-4.qa.suse.de", username="root")
[2023-03-30T09:45:40.840506+02:00] [debug] [pid:754318] Could not connect to root@grenache-4.qa.suse.de, Retrying after some seconds...

Actions #2

Updated by okurz about 1 year ago

  • Tags set to infra
Actions #3

Updated by JERiveraMoya about 1 year ago

This backend has been perfectly working until this was fixed: https://progress.opensuse.org/issues/124685#note-26
I got answer on that ticket that there is not correlation, but anyway from the same date is happening, in latest build is even less stable, restarting 3 times automatically is not enough to make pass a job.

Actions #4

Updated by JERiveraMoya about 1 year ago

I could check that this was happening (not an exact list):
NOVALINK_LPAR_ID 4,5,6,10 were working fine.
NOVALINK_LPAR_ID 7,8,9 fail consistently.

This is our salt configuration: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls

Adding more information from:
https://sd.suse.com/servicedesk/customer/portal/1/SD-118372

from Michal Suchanek:

The redcurrant machine is a very strange configuration. There is Novalink installed but the management is taken by powerhmc1 (which means that hmc1 can manage the machine but hmc2 or the management partition cannot). The Novalink management partition is not even booted.

From the HMC LPARS redcurrant-1 to redcurrant-12 appear configured and running.

lssyscfg -m redcurrant -r lpar -F lpar_id,name,state
1,vios,Running
2,novalink,Open Firmware
3,redcurrant-1,Running
4,redcurrant-2,Running
5,redcurrant-3,Running
6,redcurrant-4,Running
7,redcurrant-10,Running
8,redcurrant-5,Running
9,redcurrant-12,Running
10,redcurrant-11,Running
11,redcurrant-6,Running
12,redcurrant-8,Running
13,redcurrant-9,Running
14,redcurrant-7,Running
15,Multipath-test,Not Activated
16,VIOS2,Not Activated

No grub configuration was provided to grub, grub finished, nothing was booted.

Actions #5

Updated by JERiveraMoya about 1 year ago

Actions #6

Updated by JERiveraMoya about 1 year ago

MR to disabling failing instance for NOVALINK:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/520

Actions #7

Updated by nicksinger about 1 year ago

  • Subject changed from [openQA][infra][worker]spvm backend worker boot failed to [openQA][infra][worker] ppc64 fails to load grub2 completely over tftp/pxe on qanet

I think we see similar issues on non-novalink machines: https://openqa.suse.de/tests/10924663#step/bootloader/19

Actions #8

Updated by okurz about 1 year ago

  • Priority changed from Normal to Urgent
Actions #9

Updated by okurz about 1 year ago

  • Subject changed from [openQA][infra][worker] ppc64 fails to load grub2 completely over tftp/pxe on qanet to [openQA][infra][worker] ppc64 fails to load grub2 completely over tftp/pxe on qanet and PXE load timeout issues size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #10

Updated by okurz about 1 year ago

  • Related to action #125219: Use qa-power8 for ppc tests in o3 - try the other suggestions added
Actions #11

Updated by nicksinger about 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #12

Updated by okurz about 1 year ago

Actions #13

Updated by nicksinger about 1 year ago

I took grenache-5 to experiment. I realized that the IP config in the PowerPC Firmware war missing and reconfigured it with the following values:

Server IP.....................10.162.0.1
Client IP.....................10.162.6.241
Gateway IP....................10.162.63.254
Subnet Mask...................255.255.192.0

Afterwards I could still observe the same issue and checked with tcpdump that all files are served correctly from qanet which is the case. However, I found https://www.ibm.com/support/pages/could-not-allocate-memory-rtas-grub-or-yaboot-ppcbe and tried the described steps in "Resolving The Problem" to reset the nvram content which indeed worked. I'm now able to boot into our grub PXE menu again. However I have no clue what caused this - the article mentions "Memory fragmentation" but I didn't see the described error messages, just the symptoms where the same.

Actions #14

Updated by nicksinger about 1 year ago

I did the same now for grenache-6 and 7 (not sure why 7 was disabled in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/520 as it was reported as "working")

Actions #15

Updated by openqa_review about 1 year ago

  • Due date set to 2023-05-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #16

Updated by acarvajal about 1 year ago

I think we're seeing the same with the nessberry LPARs: https://openqa.suse.de/tests/10957776.

It does seem to work sporadically; from the latest 10 runs https://openqa.suse.de/tests/10957776#next_previous over the past 2 weeks, it has worked 3 times.

BTW, is tftpd active in qanet? I see it configured in qanet:/etc/sysconfig/tftp, but there's no service running (that I could see) and it's also not configured as part of xinetd.

Actions #17

Updated by nicksinger about 1 year ago

acarvajal wrote:

I think we're seeing the same with the nessberry LPARs: https://openqa.suse.de/tests/10957776.

It does seem to work sporadically; from the latest 10 runs https://openqa.suse.de/tests/10957776#next_previous over the past 2 weeks, it has worked 3 times.

Not sure if it is actually the same. The symptoms look a little bit different. But you could also try a nvram reset described in https://www.ibm.com/support/pages/could-not-allocate-memory-rtas-grub-or-yaboot-ppcbe - afterwards make sure to configure TFTP/BOOTP in the SMS menu again properly. I also saw some settings regarding block-size in there - you could try to set it from 512 to 1024, it might help loading the required files in a more stable manner.

I have yet to find out why this happens. Maybe this is a firmware issue from PPC machines or maybe it is caused by many re-installs from our side.

Actions #18

Updated by nicksinger about 1 year ago

nicksinger wrote:

acarvajal wrote:

I think we're seeing the same with the nessberry LPARs: https://openqa.suse.de/tests/10957776.

It does seem to work sporadically; from the latest 10 runs https://openqa.suse.de/tests/10957776#next_previous over the past 2 weeks, it has worked 3 times.

Not sure if it is actually the same. The symptoms look a little bit different. But you could also try a nvram reset described in https://www.ibm.com/support/pages/could-not-allocate-memory-rtas-grub-or-yaboot-ppcbe - afterwards make sure to configure TFTP/BOOTP in the SMS menu again properly. I also saw some settings regarding block-size in there - you could try to set it from 512 to 1024, it might help loading the required files in a more stable manner.

I have yet to find out why this happens. Maybe this is a firmware issue from PPC machines or maybe it is caused by many re-installs from our side.

never mind these steps. I realized that tftpd was down on qanet. I restarted this service and the job and it looks better now: https://openqa.suse.de/tests/10958779 - could you confirm @acarvajal ?

Actions #19

Updated by acarvajal about 1 year ago

nicksinger wrote:

never mind these steps. I realized that tftpd was down on qanet. I restarted this service and the job and it looks better now: https://openqa.suse.de/tests/10958779 - could you confirm @acarvajal ?

Just restarted the test. I'll check later how it goes. Thanks for the help.

Actions #20

Updated by nicksinger about 1 year ago

  • Status changed from In Progress to Feedback

So I focused on 4, 5 and 6 now as these where the only ones I was able to spot reproducible problems. All three instances can boot again as expected:

  1. https://openqa.suse.de/tests/10968507
  2. https://openqa.suse.de/tests/10968508
  3. https://openqa.suse.de/tests/10968509

Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/526 to revert the previous disable. I think after it is merged and executing a "proper" job, we can close this here. @JERiveraMoya what do you think?

Actions #21

Updated by nicksinger 12 months ago

  • Status changed from Feedback to Resolved

Workers look fine also after the long weekend. I'm considering this done for now.

Actions #22

Updated by openqa_review 11 months ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: default@ppc64le-hmc
https://openqa.suse.de/tests/11162714#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #23

Updated by nicksinger 11 months ago

  • Status changed from Feedback to Resolved

different issue in the meantime. Removed bugref and restarted job.

Actions #24

Updated by okurz 11 months ago

  • Copied to action #129670: [qe-core] Reset ppc nvram before trying to boot Linux kernel to prevent PXE related file reading timeouts added
Actions

Also available in: Atom PDF