Project

General

Profile

action #116078

coordination #118642: [epic] PPC testing capabilities in openqa.opensuse.org

Recover o3 worker power8, restore IPMI access size:M

Added by mkittler 3 months ago. Updated 3 days ago.

Status:
Feedback
Priority:
High
Assignee:
Target version:
Start date:
2022-08-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

After a firmware update the IPMI access broke. It booted once successfully but now it is stuck on boot and we needed IPMI access to recover it.

Acceptance criteria

  • AC1: IPMI access has been restored
  • AC2: ppc64le qemu openQA jobs on openqa.opensuse.org are executed again as expected

Context

Suggestions

  • Follow suggestions from https://bugzilla.opensuse.org/show_bug.cgi?id=1202275 and https://mailman.suse.de/mlarch/SuSE/research/2022/research.2022.09/msg00047.html
  • Write an email, raise an alarm, state that we don't have any ppc tests running for openSUSE right now AT ALL!!!1 (factory first!!!) We need help of someone that can setup powerPC machines and at best has direct access to SUSE Nbg Maxtorhof SRV1. If there is no one then escalate and question if SUSE can support this at all for the future
  • As alternative coordinate access to SRV1 for ourselves and try to fix it in place.
  • Boot via recovery media over PXE.
  • Attach a "Virtual CD" which can (IIRC) be attached over the HMC.

Related issues

Related to openQA Infrastructure - action #115094: [tools] test fails in bootloader_start: redcurrant is down size:MResolved2022-08-08

History

#1 Updated by cdywan 3 months ago

  • Project changed from openQA Infrastructure to qam-qasle-collaboration
  • Description updated (diff)

#2 Updated by cdywan 3 months ago

  • Subject changed from Recover o3 worker power8, restore IPMI access to Recover o3 worker power8, restore IPMI access size:undecided

We agreed to not estimate it yet but raise it in the sync call first. Currently we can't work on it anyway due to the AC situation

#3 Updated by jstehlik 3 months ago

Take a look on unused Power8 machines in server room 2 .. probably rack 1 - Matthias will add more details

#4 Updated by jstehlik 3 months ago

  • Related to action #115094: [tools] test fails in bootloader_start: redcurrant is down size:M added

#6 Updated by cdywan 3 months ago

  • Due date set to 2022-10-28
  • Status changed from New to Blocked
  • Assignee set to cdywan

Blocking on availability of the HMC

#7 Updated by okurz 2 months ago

  • Project changed from qam-qasle-collaboration to openQA Infrastructure
  • Subject changed from Recover o3 worker power8, restore IPMI access size:undecided to Recover o3 worker power8, restore IPMI access
  • Status changed from Blocked to Feedback
  • Priority changed from Normal to High

The ticket was set to "Blocked" but https://sd.suse.com/servicedesk/customer/portal/1/SD-97044 is set to "Closed". As discussed in weekly QE sync with jstehlik and also cdywan we should add it back to the "openQA Infrastructure" ticket project and also not set it as "Blocked" because work from our side is needed.

#8 Updated by cdywan 2 months ago

As discussed I sent an email to research@suse.de to find out what options we have here.

#9 Updated by okurz 2 months ago

  • Subject changed from Recover o3 worker power8, restore IPMI access to Recover o3 worker power8, restore IPMI access size:M
  • Description updated (diff)
  • Status changed from Feedback to Workable
  • Assignee deleted (cdywan)

https://mailman.suse.de/mlarch/SuSE/research/2022/research.2022.09/msg00047.html is the email and there is already a response. We estimated the ticket in the weekly QE Tools estimation call 2022-09-15

#10 Updated by mkittler 2 months ago

  • Due date changed from 2022-10-28 to 2022-11-02
  • Status changed from Workable to Feedback
  • Assignee set to nicksinger

#11 Updated by nicksinger about 2 months ago

I've asked Hector for help to reach out to somebody from IBM: https://suse.slack.com/archives/C02CANHLANP/p1664464326700249

#12 Updated by cdywan about 2 months ago

nicksinger wrote:

I've asked Hector for help to reach out to somebody from IBM: https://suse.slack.com/archives/C02CANHLANP/p1664464326700249

Hector is suggesting we open a bugzilla with an explanation, versions used and references to issues to be mirrored on the IBM side

#13 Updated by okurz about 2 months ago

  • Parent task set to #118642

#14 Updated by nicksinger about 2 months ago

  • Status changed from Feedback to Blocked

cdywan wrote:

nicksinger wrote:

I've asked Hector for help to reach out to somebody from IBM: https://suse.slack.com/archives/C02CANHLANP/p1664464326700249

Hector is suggesting we open a bugzilla with an explanation, versions used and references to issues to be mirrored on the IBM side

Not sure what the process is but I created https://bugzilla.opensuse.org/show_bug.cgi?id=1204277 now. Hope this makes it way to IBM somehow.
In general this now raised an upcoming discussion on how we want to proceed with power testing for openSUSE (see https://suse.slack.com/archives/C02CANHLANP/p1665658811414529?thread_ts=1664464326.700249&cid=C02CANHLANP)

#15 Updated by okurz about 1 month ago

  • Status changed from Blocked to In Progress

We logged in over https://powerhmc1.arch.suse.de and clicked on "Connect Systems…" and added the machine openqaworker-power8-ipmi.suse.de with the credentials as in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L55 . The system appeared with an entry for IPv4 address 10.160.1.104 but the message "Version mismatch". Details: "This version of the management console is unable to manage the system. Update the management console Licensed Internal Code. For information on updating Licensed Internal Code, see 'Update HMC' in the Operations Guide for the HMC and Managed Systems, SA76-0085.". nicksinger suggests to use powerhmc3 for which we do not have credentials.

So nicksinger please ask aeisner for credentials or an update of powerhmc1.

#16 Updated by nicksinger about 1 month ago

  • Status changed from In Progress to Blocked

Ticket for a user in powerhmc3 created: https://sd.suse.com/servicedesk/customer/portal/1/SD-101953

#17 Updated by okurz about 1 month ago

  • Priority changed from High to Urgent

Becoming urgent now with external users asking for this and more time without no ppc64le testing going on. I extended the request in https://sd.suse.com/servicedesk/customer/portal/1/SD-101953

#18 Updated by okurz about 1 month ago

https://sd.suse.com/servicedesk/customer/portal/1/SD-101953 received an update. According to aeisner we have credentials on the system for all current team members. I tried to connect systems qa-power8, qa-power8-3 and the o3 machine power8-ipmi but I can only see two machines "haldir" and "legolas". I assume that we are members of some "QA-SAP" group which would not make sense. I put a comment into the ticket and waiting for aeisner's response.

#19 Updated by okurz about 1 month ago

  • Due date changed from 2022-11-02 to 2022-11-11
  • Assignee changed from nicksinger to okurz

#20 Updated by okurz about 1 month ago

  • Status changed from Blocked to In Progress
  • Assignee changed from okurz to nicksinger

As discussed with aeisner apparently it’s not possible to upgrade the HMC nor the internal licenses. We agreed that with the team SUSE QE Tools having access to https://powerhmc3.arch.suse.de/ we can put all machines with newer firmware into there. Now we can try to get power8.openqanet.opensuse.org into one of the HMCs.

#21 Updated by okurz about 1 month ago

  • Assignee changed from nicksinger to okurz

#22 Updated by okurz about 1 month ago

I have added power8.openqanet.opensuse.org as well as qa-power8-3.qa.suse.de in both https://powerhmc1.arch.suse.de/ as well as https://powerhmc3.arch.suse.de/ . In both cases the machine shows up under the IPv4 address 10.160.1.104 with the error message as from #116078#note-15 . Maybe the firmware update as explained in https://bugzilla.opensuse.org/show_bug.cgi?id=1202275 was actually newer than the HMC install date of hmc3, installed 2021-12. So, can we update hmc3? First we tried to change from OPAL mode to PowerVM mode in ASM. Then there was a message about needing system password so we set a "new" system password, the one from salt pillars. Then we called the action to reset the system connection. Now the machine was correctly found in hmc3. We powered on the system. Also in hmc1 we reset the connection and the system showed up fine. But we could not find where to configure IPMI or something. We could connect to the console of the "system partition" over HMC and we were in a "SMS menu" where one could select boot entries and such but there we did not find any storage volume, possibly because that would need to be a virtualized storage container anyway. So we opted to remove the machine from both HMCs again but ASM still prevented us to change to OPAL mode. First in ASM one needs to change in "Hardware Management Consoles" to reset to a non-HMC managed instance. and change the firmware mode back to OPAL mode. Still can't connect over IPMI.

#23 Updated by okurz about 1 month ago

  • Due date deleted (2022-11-11)
  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

system does not boot up. Please create ticket to ask Eng-Infra to connect monitor and keyboard to machine and check why it does not boot. Then as soon as someone can login configure IPMI access with local "ipmitool" invocation"

#24 Updated by nicksinger 25 days ago

  • Assignee set to nicksinger

#25 Updated by nicksinger 23 days ago

  • Status changed from Workable to Blocked

Ticket created https://sd.suse.com/servicedesk/customer/portal/1/SD-103442 and waiting for answers now. Most likely we need to be there in person with somebody from eng-infra but lets see how it goes.

#26 Updated by okurz 17 days ago

  • Status changed from Blocked to In Progress

From the ticket:

As we chatted in slack, this weird powermachine has no option (display port vga etc) to direct connect it. sorry for not find a solution so far

I suggest to ask around to more people what we try to achieve, what has been done, what failed and ask what we should do then

#27 Updated by openqa_review 17 days ago

  • Due date set to 2022-11-25

Setting due date based on mean cycle time of SUSE QE Tools

#28 Updated by nicksinger 12 days ago

  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)

Quick recap of the last days: physical installation/recovery cannot be conducted because the machine misses any graphical output port.
We also came up with the idea to recovery/configure the IPMI over a lpar while the machine is managed by the HMC. I connected the machine to powerhmc3 but was not able to boot any recovery media over PXE. The only idea left is to try the same with a "Virtual CD" which can (IIRC) be attached over the HMC.

#29 Updated by cdywan 12 days ago

  • Description updated (diff)

#30 Updated by mkittler 12 days ago

I suggest to ask around to more people what we try to achieve, what has been done, what failed and ask what we should do then

We've already started a mailing list thread a while ago: https://mailman.suse.de/mlarch/SuSE/research/2022/research.2022.09/msg00047.html

The conclusion was to ask Hector: https://suse.slack.com/archives/C02CANHLANP/p1664464326700249 - This discussion switched to Power9 at some point. Hector eventually forwarded some information to IBM but it was apparently not of any help. As far I as can see, we haven't gained any new information on how to restore IPMI access of the machine.

#31 Updated by okurz 11 days ago

Regarding PowerPC management, everyone should be able to login on https://powerhmc3.arch.suse.de/

#32 Updated by okurz 11 days ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

As discussed in the weekly QE sync meeting I will try to escalate the issue and bring in more help. If there is nobody available at the company providing reasonable guidance then we for sure have a problem :D

#33 Updated by okurz 11 days ago

  • Due date deleted (2022-11-25)
  • Status changed from In Progress to Blocked

A draft to a message in #engineering:

Hi, we have repeatedly brought up the topic but I am trying again as we could not find a solution: We have PowerPC machines, Power8 in this example, …

first trying with #119059

#35 Updated by okurz 11 days ago

Helpful pointers from Michal Suchanek:

(Michal Suchanek) doesn't openQA use network boot on Power machines already?
This is an article that shows how to use low-level VIOS commands to create and manage virtual optical media and drives
https://www.ibm.com/support/pages/how-configure-vios-media-repositoryvirtual-media-library-ex-aix-installrestore
It's also posssible to configure on HMC https://www.ibm.com/support/pages/creating-virtual-optical-drive-hmc-managed-vios-partition
ibm.comibm.com
If you have bare metal machines that cannot be connected to HMC this (likely) applies https://www.ibm.com/docs/en/linux-on-systems?topic=systems-petitboot-bootloader
System can be booted directly from a network server, https://download.suse.de/install/SLP/ should do.
In OPAL mode the last one applies. You either have PowerVM LPARS or OPAL, not both
(Oliver Kurz) Can I see somewhere if the physical machine has a physical network connection besides the one used for management?
(Michal Suchanek) On a HMC-connected machine you would see it on HMC. All machines have some sort of hardware inventory available on the web management interface but the interface varies depending on product line.
(Oliver Kurz) would you like to check the machine "qa-power8" on https://powerhmc3.arch.suse.de/dashboard/#resources/systems yourself? I failed to find > (Michal Suchanek) The petitboot bootloader is Linux so if you boot into that it should have an option to enter shell, too
(Oliver Kurz) Yes, I used that to kexec, no problem if I would be able to reach that state. But for the machine"qa-power8" I am not even sure if network is connected and for a second machine power8.openqanet.opensuse.org the thing is according to https://www.ibm.com/docs/en/linux-on-systems?topic=qsglpsls-installing-ubuntu-power-system-lc-servers-using-network-boot we would need a monitor+keyboard at least for the initial setup after we lost remote IPMI after a firmware upgrade but according to Nick Singer in https://progress.opensuse.org/issues/116078#note-28 the machine does not have any graphical output port
(Michal Suchanek) Isn't there an option to use a physical serial port?
(Oliver Kurz) I would need to ask someone to get into that server room Nbg SRV1 and check that
(Michal Suchanek) If you are unsure that network is connected you need to ask people who connect the network. In petitboot you should be able to configure network card and see what MAC address it has.
(Michal Suchanek) the option to use physical serial port would have to be configurable in the web management [ASM]

Turned out to be available over "System Service Aids", "Serial Port Setup" but only if the machine is not managed by an HMC

#37 Updated by okurz 9 days ago

  • Tags set to next-office-day

Potentially we can move power8 out of NUE1 SRV1 and to Frankencampus where it might be easier to handle with non-restricted physical access

#38 Updated by okurz 3 days ago

  • Status changed from Blocked to Feedback
  • Assignee changed from okurz to nicksinger
  • Priority changed from Urgent to High

nicksinger you wanted to wait for mmaher/gschlotter to tell you when they go to SRV1 and try with a thumb drive and serial port wasn't it?
yes, it will be our only way to access the machine currently

Also available in: Atom PDF