Project

General

Profile

Actions

action #159048

open

Setup new Power10 machine for QE LSG in PRG2 (S/N 7882391) size:M

Added by mgriessmeier 8 months ago. Updated about 12 hours ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-04-16
Due date:
2024-12-17 (Due in 6 days)
% Done:

0%

Estimated time:

Description

(JIRA-SD: https://sd.suse.com/servicedesk/customer/portal/1/SD-154392)
according to @horon, two new power10 machines have been arrived in PRG2 (among others). One of this machine (S/N 7882391) is intended for use in QE LSG and will be integrated into openQA. Ideally this machine should be mounted into PRG2-J-J11.

(I am aware that mentioned rack is full at the moment, so one goal of the ticket would also be to find a solution for this.)

As for requirements of the machines, the same apply as they do for our P9 machines in this rack (e.g. redcurrant).oqa.prg2.suse.org

This means:

AC1 : Machine is racked and connected to power and network in an openQA-assigned rack
AC2 : Machine is reachable over SSH from openqa.suse.de over FQDN (TBD)
AC3 : ASM of the machine is reachable from powerhmc1.oqa.suse.org
AC4 : Racktables is up-to-date with the network connection details


Files


Related issues 4 (2 open2 closed)

Copied to openQA Infrastructure (public) - action #162443: Setup of dedicated HMC within qe.prg2.suse.org size:SResolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #166775: Support the move of haldirResolvedokurz2024-04-16

Actions
Copied to openQA Infrastructure (public) - action #173863: Do more generic PowerPC training, e.g. ask experts from SUSE or also IBMNew

Actions
Copied to openQA Infrastructure (public) - action #173866: Share the learnings from Power10 setup e.g. in blog post or workshopNew

Actions
Actions #1

Updated by mgriessmeier 8 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 8 months ago

  • Tags set to infra, reactive work
  • Category set to Feature requests
  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version set to Ready
Actions #3

Updated by okurz 8 months ago · Edited

Following points discussed with mgriessmeier:

  1. Moving around machines in PRG2-J11 to pack the rack 100% sounds like a lot of effort and will only be a time-limited solution because sooner or later we need to add or change something again so I recommend against doing that
  2. legolas might be moved because it’s not used in openqa.suse.de directly, same as huckleberry, soapberry, blackcurrant, cloudberry but that would mean we need access to the HMC powerhmc1.oqa.prg2.suse.org OR a separate HMC as requested in https://jira.suse.com/browse/ENGINFRA-3764 . If an HMC can be provided in PRG2e then I see no problem to move those machines to PRG2e. haldir should not be moved as it’s currently used in openqa.suse.de
  3. If access to powerhmc1.oqa.prg2.suse.org can be provided or the separate HMC can be provided in PRG2e then the easiest for now would be to place the new Power10 machine in PRG2e however that would be less consistent as then more “production” oqa.prg2.suse.org machines would be in PRG2e whereas non-oqa.prg2.suse.org machines would be in PRG2
  4. It would be preferred to have “production”-load in PRG2 meaning we need more space in PRG2. Generally we are flexible where in PRG2 that would be.
  5. You could also put machines into the neighboring rack J12, reserved for openqa.opensuse.org, but only if there are no further restrictions from IT. For this keep in mind that machines in J12 are in a dedicated DMZ network so not part of SUSE internal networks. One could either just put machines in J12 and still connect to the switch in J11, a bit messy, or configure the switch in J12 to serve multiple networks, also a bit messy and potentially insecure if one is not careful with configuration
  6. Maybe there is also other space in PRG2 that can be used and easier to connect to than PRG2e, e.g. J5 which is currently empty, or move the content from J10, the openSUSE rack, to have a neighboring rack for similar purpose

the problem of physical space is also related to #153736 which is about nessberry that "should" go to J11 but does not have enough free space.

In #132140-4 I took notes regarding the original plan to put PowerPC machines in PRG2 for now

Meeting with jford, mcaj, horon, mgriessmeier, gpfuetzenreuter. We decided into which racks the PowerPC in PRG2 should go. There are two racks planned for LSG QE: PRG2-J11 for QE&openqa.suse.de, PRG2-J12 for openqa.opensuse.org […] no specific place was planned within the QE/openQA racks so far but we have some space available so we planned all those PowerPC machines to go to PRG2-J11 as well. If there is not enough space those machines can move to other spare racks based on consideration by Eng-Infra

Regarding that one should also consider that at this time, 2023-07-07, there were no plans regarding PRG2e yet and NUE3 was planned to be a viable hot redundant datacenter. After those plans changed it was decided to move more PowerPC machines to PRG2/PRG2e.

Actions #4

Updated by okurz 8 months ago

  • Parent task set to #159543
Actions #6

Updated by okurz 8 months ago

One more idea: We are currently using only 17 HEs in PRG2e in https://racktables.nue.suse.com/index.php?page=rack&rack_id=24562 plus estimated 2 HE for new Power10 would be 19 HEs. Maybe the easiest solution for all would be to move 19 HEs to PRG2 if IT can provide the space.

Actions #7

Updated by livdywan 7 months ago

okurz wrote in #note-2:

Thx, will track https://sd.suse.com/servicedesk/customer/portal/1/SD-154392

Looks like work on the HMC is being tracked in https://jira.suse.com/browse/ENGINFRA-4259 now.

Actions #9

Updated by okurz 6 months ago

I overlooked an email from 2024-05-29 with credentials. Created
https://gitlab.suse.de/openqa/password/-/merge_requests/14
accordingly for the second HMC.

Actions #10

Updated by okurz 6 months ago

https://gitlab.suse.de/openqa/password/-/merge_requests/14 now merged myself. gschlotter informed me that tomorrow huckleberry will be moved as decided in https://sd.suse.com/servicedesk/customer/portal/1/SD-154392 . To replicate the latest status from https://sd.suse.com/servicedesk/customer/portal/1/SD-154392 here

Moroni Flores 5 days ago
Oliver's suggestion is sensible and can be actioned. We will move the machines listed from PRG2 to PRG2e and mount also the new power10 machine with them. DCOps will decide in which rack they will be mounted.

Oliver Kurz 1 week ago
Discussed with mgriessmeier and based on suggestions by gschlotter and based on the wish by IT to move all “QE but not openQA” machines to PRG2e mgriessmeier and me suggest the following:

  1. We can move the following machines to PRG2e as they are “QE” and not “openQA”: haldir, legolas, huckleberry, soapberry, blackcurrant
  2. Not cloudberry as that machine seems to be free so we would like to repurpose it for openQA directly as openQA production worker so it’s already in the right location.
  3. nessberry is already in PRG2e and should be able to stay there as we now have the separate HMC and we are moving other QE PowerPC machines next to it.
  4. The two new Power10 machines can then also stay or be mounted in PRG2e for now, next to haldir, legolas, huckleberry, soapberry, blackcurrant

The above suggestions hold as long as we can assume that we will still be able to interconnect those machines with openqa.suse.de aka. machines in the oqa.prg2.suse.org domain.

Actions #11

Updated by okurz 6 months ago

  • Copied to action #162443: Setup of dedicated HMC within qe.prg2.suse.org size:S added
Actions #12

Updated by livdywan 6 months ago

okurz wrote in #note-10:

https://gitlab.suse.de/openqa/password/-/merge_requests/14 now merged myself. gschlotter informed me that tomorrow huckleberry will be moved as decided in https://sd.suse.com/servicedesk/customer/portal/1/SD-154392

Server is racked in J11 and hmc's are connected to TORs, power attached, server is powered off. https://racktables.nue.suse.com/index.php?page=object&object_id=28056
Once legolas is moved the drawer will be racked and attached to the server/activated.

Things are happening 😸

Actions #13

Updated by livdywan 5 months ago

[...] https://sd.suse.com/servicedesk/customer/portal/1/SD-154392

If no other action is needed at the moment I would close the ticket as racking, cabling are completed.

Okay to continue here then?

Actions #14

Updated by okurz 5 months ago

livdywan wrote in #note-13:

[...] https://sd.suse.com/servicedesk/customer/portal/1/SD-154392

If no other action is needed at the moment I would close the ticket as racking, cabling are completed.

Okay to continue here then?

it's simple: If this ticket is blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-154392 then it's unblocked as soon as https://sd.suse.com/servicedesk/customer/portal/1/SD-154392 is closed which is not the case yet.

Actions #15

Updated by nicksinger 4 months ago

  • Assignee changed from okurz to nicksinger

taking over from here

Actions #16

Updated by nicksinger 4 months ago

I found some new machine which identifies itself as "BMC-10.255.255.34" in the HMC. Unfortunately no luck with the credentials so I assume this is just the wrong machine. Unlike the others it doesn't show a proper serial number by which I could identify the system. This might mean we need the higher HMC version mentioned by Michal (10R3, we have 10R1) but before upgrading the HMC I want to confirm with infra if the display of the machine now shows the correct IP.

Actions #17

Updated by nicksinger 4 months ago

Conrad fixed the VLAN configuration for the ASMI and I was able to discover it in the "HMC/ASMI network" as 10.255.255.34. Adding it directly to the HMC resulted in strange issues while connecting so I had to forward the web-port of the ASMI (443) to my local machine via ssh on the HMC. With this I was able to access the interface and was requested to change the password which is currently documented in https://gitlab.suse.de/openqa/password/-/merge_requests/17

We now need to upgrade our HMC from 10R1 to 10R3. As this seems to be a bigger upgrade, we cannot simply apply it over the usual web interface upgrade process. It is also recommended to do a proper backup before trying the upgrade which we requested in https://sd.suse.com/servicedesk/customer/portal/1/SD-164253 . After this is done we can follow https://www.ibm.com/support/pages/node/7074590 to conduct the upgrade.

Actions #18

Updated by nicksinger 4 months ago

  • Status changed from Blocked to In Progress
Actions #19

Updated by mkittler 4 months ago

  • Subject changed from Setup new Power10 machine for QE LSG in PRG2 (S/N 7882391) to Setup new Power10 machine for QE LSG in PRG2 (S/N 7882391) size:M
  • Description updated (diff)
Actions #20

Updated by openqa_review 4 months ago

  • Due date set to 2024-09-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by livdywan 3 months ago

  • Due date deleted (2024-09-11)

I'm guessing this is not in progress right now in favor of more urgent tasks

Actions #22

Updated by livdywan 3 months ago

  • Status changed from In Progress to Workable
Actions #23

Updated by okurz 3 months ago

Actions #24

Updated by nicksinger 3 months ago · Edited

  • Assignee deleted (nicksinger)

Currently looking into getting redcurrant stable again and remove duplicate ASM connections which cause major stability/reliability problems:

However, if somebody is motivated to look into the upgrade procedure I think this can be done in parallel. According to https://sd.suse.com/servicedesk/customer/portal/1/SD-164253 and by confirmation in a personal slack message we have a HMC backup at hand if anything goes wrong.

Actions #25

Updated by okurz 3 months ago

  • Priority changed from Normal to Low
Actions #26

Updated by gpathak about 2 months ago

  • Assignee set to gpathak
Actions #27

Updated by gpathak about 2 months ago

@okurz @nicksinger please add me to the SD Ticket, I don't have access to https://sd.suse.com/servicedesk/customer/portal/1/SD-164253

Actions #28

Updated by okurz about 2 months ago

The ticket is shared with "OSD Admins". If you are part of that group which you should as that is part of onboarding them you have access

Actions #29

Updated by okurz 21 days ago

  • Priority changed from Low to Normal
Actions #30

Updated by gpathak 19 days ago · Edited

Do we have any ftp server in the 10.145.14.x subnet?
If not, which machine should be used to setup the ftp server?

The powerhmc powerhmc.qe.prg2.suse.org cannot reach internet.
In order to perform an upgrade, IBM recommends to stage the updates on local server.

Actions #31

Updated by mgriessmeier 16 days ago · Edited

gpathak wrote in #note-30:

Do we have any ftp server in the 10.145.14.x subnet?

There is a IBM subfolder with IBM software in https://dist.suse.de/IBM/ which should be reachable from 10.145.14.x
maybe the correct file is already there? (https://dist.suse.de/IBM/HMC.power.10R3Update/ looks promising, no?)
Otherwise, you could ask SUSE IT to sync your files there

Actions #32

Updated by gpathak 16 days ago

  • Status changed from Workable to In Progress
Actions #33

Updated by openqa_review 16 days ago

  • Due date set to 2024-12-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #34

Updated by gpathak 15 days ago

powerhmc1.oqa.prg2.suse.org has the fix iFix for HMC V10R1 M1010 and Service Pack MF71508 - HMC V10R1 M1023 already applied
So we can proceed directly for an Upgrade, but it is required to save the upgrade data using saveupgdata, IBM shows example of saving the data on a local USB flash drive.
saveupgdata also have options to store the data on a ftp server, since dist.suse.de cannot be used to save the data, we need to setup a temporary ftp server in the same network to save the upgrade data.

Actions #35

Updated by gpathak 15 days ago

Setup ftp server on worker30.oqa.prg2.suse.org temporarily.
Saved the upgrade data using saveupgdata -r diskftp -h worker30.oqa.prg2.suse.org -u anonymous --passwd abc -d upload/ --migrate -i netcfg
After that performed the upgrade and rebooted the hmc.
The HMC is now up, but the systems are in "Failed Authentication" state.

Actions #36

Updated by gpathak 14 days ago · Edited

Thanks a lot to @nicksinger for helping me with resolving the system authentication issue.

  • Removing and re-adding the systems by their internal (private to HMC) IP address resolved the issue in this case
  • Set HMC as PowerVM management controller for grenache and redcurrant
    chcomgmt -m grenache -o setcontroller -t norm
    chcomgmt -m redcurrant -o setcontroller -t norm

The HMC is upgraded to V10R3

The new power10 ASM is reachable from powerhmc1.oqa.suse.org.
AC3 is done!
Need to bring the machine in network via FQDN and set it up for running tests.

Actions #37

Updated by gpathak 12 days ago

gpathak wrote in #note-36:

Thanks a lot to @nicksinger for helping me with resolving the system authentication issue.

  • Removing and re-adding the systems by their internal (private to HMC) IP address resolved the issue in this case
  • Set HMC as PowerVM management controller for grenache and redcurrant
    chcomgmt -m grenache -o setcontroller -t norm
    chcomgmt -m redcurrant -o setcontroller -t norm

The HMC is upgraded to V10R3

The new power10 ASM is reachable from powerhmc1.oqa.suse.org.
AC3 is done!
Need to bring the machine in network via FQDN and set it up for running tests.

Since, we start getting issues in openQA tests for ppc

We have to release grenache from HMC master: chcomgmt -m grenache -o relmaste

Next step involves setting up a VIOS, for that internet connectivity is needed, I have to sync with @nicksinger or @jbaier_cz

Actions #38

Updated by gpathak 11 days ago

VIOS is installed on Power10 - paraffin.

Network configuration via mktcpip per https://gitlab.suse.de/hsehic/qa-css-docs/-/blob/master/infrastructure/power9-configuration.md#vios-server-installation is pending.

Updated by gpathak 9 days ago

Getting some issue while trying to add Logical Volume to the LPAR, the LPAR complains that the vhost associated with the ID isn't available

Checked and verified that the vhosts are available

Actions #40

Updated by gpathak 7 days ago

  • Status changed from Workable to In Progress
Actions #41

Updated by gpathak 7 days ago

Since, Logical Volume wasn't getting attached to VIOS, I decided to re-install the VIOS. After deleting the previous VIOS, re-installation always gets suck at

Wed Dec  4 13:14:41 2024
-----------/var/log/nimol.log :---------------
Wed Dec  4 13:21:36 2024  nimol: installios: led code=Starting kernel : ,info=Installing and configuring.

There are no logs from AIX Kernel and installation doesn't proceeds further.

Actions #42

Updated by gpathak 7 days ago

@okurz @nicksinger
I need some help in copying this https://dist.suse.de/IBM/Power/VIOS_4.1.0.20/ VIOS image to HMC. The cpviosimg command can be used to copy the VIOS image either via NFS or SFTP.
I downloaded the image on worker30.oqa.prg2.suse.org in my home directory, but when trying to copy the image on HMC, getting The specified SFTP hostname is not valid.

Actions #43

Updated by gpathak 6 days ago

  • Assignee changed from gpathak to nicksinger
Actions #44

Updated by okurz 5 days ago

  • Status changed from In Progress to Workable
Actions #45

Updated by okurz 5 days ago · Edited

  • Possible alternative approaches to proceed:
    1. Collect all current problems, state them together and escalate that we can't proceed without help
    2. Organize a joint working session with PowerPC experts outside the team
    3. Only continue the work after multiple people from the team can join together
    4. Do more generic PowerPC training, e.g. ask experts from SUSE or also IBM -> #173863
    5. DONE Change the assignee of the ticket for change of perspective
    6. Consider squad rotation e.g. to UV squad who also manually administrate and work with PowerPC
Actions #46

Updated by okurz 5 days ago

  • Copied to action #173863: Do more generic PowerPC training, e.g. ask experts from SUSE or also IBM added
Actions #47

Updated by okurz 5 days ago

  • Copied to action #173866: Share the learnings from Power10 setup e.g. in blog post or workshop added
Actions #48

Updated by okurz about 12 hours ago

  • Due date changed from 2024-12-10 to 2024-12-17

Primary goal: Have a plan going forward in new ticket(s) that the machine is usable and finally used in OSD. Secondary goal: Please take a look into the current state. If it's feasible get the VIOS and something like at least one LPAR running.

Actions #49

Updated by nicksinger about 12 hours ago

  • Status changed from Workable to In Progress
Actions

Also available in: Atom PDF