action #159048
openSetup new Power10 machine for QE LSG in PRG2 (S/N 7882391) size:M
0%
Description
(JIRA-SD: https://sd.suse.com/servicedesk/customer/portal/1/SD-154392)
according to @horon, two new power10 machines have been arrived in PRG2 (among others). One of this machine (S/N 7882391) is intended for use in QE LSG and will be integrated into openQA. Ideally this machine should be mounted into PRG2-J-J11.
(I am aware that mentioned rack is full at the moment, so one goal of the ticket would also be to find a solution for this.)
As for requirements of the machines, the same apply as they do for our P9 machines in this rack (e.g. redcurrant).oqa.prg2.suse.org
This means:
AC1 : Machine is racked and connected to power and network in an openQA-assigned rack
AC2 : Machine is reachable over SSH from openqa.suse.de over FQDN (TBD)
AC3 : ASM of the machine is reachable from powerhmc1.oqa.suse.org
AC4 : Racktables is up-to-date with the network connection details
Files
Updated by okurz 8 months ago
- Tags set to infra, reactive work
- Category set to Feature requests
- Status changed from New to Blocked
- Assignee set to okurz
- Target version set to Ready
Thx, will track https://sd.suse.com/servicedesk/customer/portal/1/SD-154392
Updated by okurz 8 months ago · Edited
Following points discussed with mgriessmeier:
- Moving around machines in PRG2-J11 to pack the rack 100% sounds like a lot of effort and will only be a time-limited solution because sooner or later we need to add or change something again so I recommend against doing that
- legolas might be moved because it’s not used in openqa.suse.de directly, same as huckleberry, soapberry, blackcurrant, cloudberry but that would mean we need access to the HMC powerhmc1.oqa.prg2.suse.org OR a separate HMC as requested in https://jira.suse.com/browse/ENGINFRA-3764 . If an HMC can be provided in PRG2e then I see no problem to move those machines to PRG2e. haldir should not be moved as it’s currently used in openqa.suse.de
- If access to powerhmc1.oqa.prg2.suse.org can be provided or the separate HMC can be provided in PRG2e then the easiest for now would be to place the new Power10 machine in PRG2e however that would be less consistent as then more “production” oqa.prg2.suse.org machines would be in PRG2e whereas non-oqa.prg2.suse.org machines would be in PRG2
- It would be preferred to have “production”-load in PRG2 meaning we need more space in PRG2. Generally we are flexible where in PRG2 that would be.
- You could also put machines into the neighboring rack J12, reserved for openqa.opensuse.org, but only if there are no further restrictions from IT. For this keep in mind that machines in J12 are in a dedicated DMZ network so not part of SUSE internal networks. One could either just put machines in J12 and still connect to the switch in J11, a bit messy, or configure the switch in J12 to serve multiple networks, also a bit messy and potentially insecure if one is not careful with configuration
- Maybe there is also other space in PRG2 that can be used and easier to connect to than PRG2e, e.g. J5 which is currently empty, or move the content from J10, the openSUSE rack, to have a neighboring rack for similar purpose
the problem of physical space is also related to #153736 which is about nessberry that "should" go to J11 but does not have enough free space.
In #132140-4 I took notes regarding the original plan to put PowerPC machines in PRG2 for now
Meeting with jford, mcaj, horon, mgriessmeier, gpfuetzenreuter. We decided into which racks the PowerPC in PRG2 should go. There are two racks planned for LSG QE: PRG2-J11 for QE&openqa.suse.de, PRG2-J12 for openqa.opensuse.org […] no specific place was planned within the QE/openQA racks so far but we have some space available so we planned all those PowerPC machines to go to PRG2-J11 as well. If there is not enough space those machines can move to other spare racks based on consideration by Eng-Infra
Regarding that one should also consider that at this time, 2023-07-07, there were no plans regarding PRG2e yet and NUE3 was planned to be a viable hot redundant datacenter. After those plans changed it was decided to move more PowerPC machines to PRG2/PRG2e.
Updated by okurz 8 months ago
One more idea: We are currently using only 17 HEs in PRG2e in https://racktables.nue.suse.com/index.php?page=rack&rack_id=24562 plus estimated 2 HE for new Power10 would be 19 HEs. Maybe the easiest solution for all would be to move 19 HEs to PRG2 if IT can provide the space.
Updated by livdywan 7 months ago
okurz wrote in #note-2:
Thx, will track https://sd.suse.com/servicedesk/customer/portal/1/SD-154392
Looks like work on the HMC is being tracked in https://jira.suse.com/browse/ENGINFRA-4259 now.
Updated by okurz 6 months ago
https://jira.suse.com/browse/ENGINFRA-4259, https://jira.suse.com/browse/ENGINFRA-4265, https://jira.suse.com/browse/ENGINFRA-4268 were all resolved but I am not aware about any information for a new HMC instance. I reopened https://jira.suse.com/browse/ENGINFRA-4268, put a comment there as well as in https://sd.suse.com/servicedesk/customer/portal/1/SD-154392
Updated by okurz 6 months ago
I overlooked an email from 2024-05-29 with credentials. Created
https://gitlab.suse.de/openqa/password/-/merge_requests/14
accordingly for the second HMC.
Updated by okurz 6 months ago
https://gitlab.suse.de/openqa/password/-/merge_requests/14 now merged myself. gschlotter informed me that tomorrow huckleberry will be moved as decided in https://sd.suse.com/servicedesk/customer/portal/1/SD-154392 . To replicate the latest status from https://sd.suse.com/servicedesk/customer/portal/1/SD-154392 here
Moroni Flores 5 days ago
Oliver's suggestion is sensible and can be actioned. We will move the machines listed from PRG2 to PRG2e and mount also the new power10 machine with them. DCOps will decide in which rack they will be mounted.Oliver Kurz 1 week ago
Discussed with mgriessmeier and based on suggestions by gschlotter and based on the wish by IT to move all “QE but not openQA” machines to PRG2e mgriessmeier and me suggest the following:
- We can move the following machines to PRG2e as they are “QE” and not “openQA”: haldir, legolas, huckleberry, soapberry, blackcurrant
- Not cloudberry as that machine seems to be free so we would like to repurpose it for openQA directly as openQA production worker so it’s already in the right location.
- nessberry is already in PRG2e and should be able to stay there as we now have the separate HMC and we are moving other QE PowerPC machines next to it.
- The two new Power10 machines can then also stay or be mounted in PRG2e for now, next to haldir, legolas, huckleberry, soapberry, blackcurrant
The above suggestions hold as long as we can assume that we will still be able to interconnect those machines with openqa.suse.de aka. machines in the oqa.prg2.suse.org domain.
Updated by okurz 6 months ago
- Copied to action #162443: Setup of dedicated HMC within qe.prg2.suse.org size:S added
Updated by livdywan 6 months ago
okurz wrote in #note-10:
https://gitlab.suse.de/openqa/password/-/merge_requests/14 now merged myself. gschlotter informed me that tomorrow huckleberry will be moved as decided in https://sd.suse.com/servicedesk/customer/portal/1/SD-154392
Server is racked in J11 and hmc's are connected to TORs, power attached, server is powered off. https://racktables.nue.suse.com/index.php?page=object&object_id=28056
Once legolas is moved the drawer will be racked and attached to the server/activated.
Things are happening 😸
Updated by livdywan 5 months ago
[...] https://sd.suse.com/servicedesk/customer/portal/1/SD-154392
If no other action is needed at the moment I would close the ticket as racking, cabling are completed.
Okay to continue here then?
Updated by okurz 5 months ago
livdywan wrote in #note-13:
[...] https://sd.suse.com/servicedesk/customer/portal/1/SD-154392
If no other action is needed at the moment I would close the ticket as racking, cabling are completed.
Okay to continue here then?
it's simple: If this ticket is blocked by https://sd.suse.com/servicedesk/customer/portal/1/SD-154392 then it's unblocked as soon as https://sd.suse.com/servicedesk/customer/portal/1/SD-154392 is closed which is not the case yet.
Updated by nicksinger 4 months ago
I found some new machine which identifies itself as "BMC-10.255.255.34" in the HMC. Unfortunately no luck with the credentials so I assume this is just the wrong machine. Unlike the others it doesn't show a proper serial number by which I could identify the system. This might mean we need the higher HMC version mentioned by Michal (10R3, we have 10R1) but before upgrading the HMC I want to confirm with infra if the display of the machine now shows the correct IP.
Updated by nicksinger 4 months ago
Conrad fixed the VLAN configuration for the ASMI and I was able to discover it in the "HMC/ASMI network" as 10.255.255.34. Adding it directly to the HMC resulted in strange issues while connecting so I had to forward the web-port of the ASMI (443) to my local machine via ssh on the HMC. With this I was able to access the interface and was requested to change the password which is currently documented in https://gitlab.suse.de/openqa/password/-/merge_requests/17
We now need to upgrade our HMC from 10R1 to 10R3. As this seems to be a bigger upgrade, we cannot simply apply it over the usual web interface upgrade process. It is also recommended to do a proper backup before trying the upgrade which we requested in https://sd.suse.com/servicedesk/customer/portal/1/SD-164253 . After this is done we can follow https://www.ibm.com/support/pages/node/7074590 to conduct the upgrade.
Updated by openqa_review 4 months ago
- Due date set to 2024-09-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 3 months ago
- Copied to action #166775: Support the move of haldir added
Updated by nicksinger 3 months ago · Edited
- Assignee deleted (
nicksinger)
Currently looking into getting redcurrant stable again and remove duplicate ASM connections which cause major stability/reliability problems:
However, if somebody is motivated to look into the upgrade procedure I think this can be done in parallel. According to https://sd.suse.com/servicedesk/customer/portal/1/SD-164253 and by confirmation in a personal slack message we have a HMC backup at hand if anything goes wrong.
Updated by gpathak about 2 months ago
@okurz @nicksinger please add me to the SD Ticket, I don't have access to https://sd.suse.com/servicedesk/customer/portal/1/SD-164253
Updated by okurz about 2 months ago
The ticket is shared with "OSD Admins". If you are part of that group which you should as that is part of onboarding them you have access
Updated by gpathak 19 days ago · Edited
Do we have any ftp server in the 10.145.14.x
subnet?
If not, which machine should be used to setup the ftp server?
The powerhmc powerhmc.qe.prg2.suse.org
cannot reach internet.
In order to perform an upgrade, IBM recommends to stage the updates on local server.
Updated by mgriessmeier 16 days ago · Edited
gpathak wrote in #note-30:
Do we have any ftp server in the
10.145.14.x
subnet?
There is a IBM subfolder with IBM software in https://dist.suse.de/IBM/ which should be reachable from 10.145.14.x
maybe the correct file is already there? (https://dist.suse.de/IBM/HMC.power.10R3Update/ looks promising, no?)
Otherwise, you could ask SUSE IT to sync your files there
Updated by openqa_review 16 days ago
- Due date set to 2024-12-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by gpathak 15 days ago
powerhmc1.oqa.prg2.suse.org
has the fix iFix for HMC V10R1 M1010
and Service Pack MF71508 - HMC V10R1 M1023
already applied
So we can proceed directly for an Upgrade, but it is required to save the upgrade data using saveupgdata
, IBM shows example of saving the data on a local USB flash drive.
saveupgdata
also have options to store the data on a ftp server, since dist.suse.de cannot be used to save the data, we need to setup a temporary ftp server in the same network to save the upgrade data.
Updated by gpathak 15 days ago
Setup ftp server on worker30.oqa.prg2.suse.org
temporarily.
Saved the upgrade data using saveupgdata -r diskftp -h worker30.oqa.prg2.suse.org -u anonymous --passwd abc -d upload/ --migrate -i netcfg
After that performed the upgrade and rebooted the hmc.
The HMC is now up, but the systems are in "Failed Authentication" state.
Updated by gpathak 14 days ago · Edited
Thanks a lot to @nicksinger for helping me with resolving the system authentication issue.
- Removing and re-adding the systems by their internal (private to HMC) IP address resolved the issue in this case
- Set HMC as PowerVM management controller for grenache and redcurrant
chcomgmt -m grenache -o setcontroller -t norm
chcomgmt -m redcurrant -o setcontroller -t norm
The HMC is upgraded to V10R3
The new power10 ASM is reachable from powerhmc1.oqa.suse.org.
AC3 is done!
Need to bring the machine in network via FQDN and set it up for running tests.
Updated by gpathak 12 days ago
gpathak wrote in #note-36:
Thanks a lot to @nicksinger for helping me with resolving the system authentication issue.
- Removing and re-adding the systems by their internal (private to HMC) IP address resolved the issue in this case
- Set HMC as PowerVM management controller for grenache and redcurrant
chcomgmt -m grenache -o setcontroller -t norm
chcomgmt -m redcurrant -o setcontroller -t norm
The HMC is upgraded to V10R3
The new power10 ASM is reachable from powerhmc1.oqa.suse.org.
AC3 is done!
Need to bring the machine in network via FQDN and set it up for running tests.
Since, we start getting issues in openQA tests for ppc
We have to release grenache from HMC master: chcomgmt -m grenache -o relmaste
Next step involves setting up a VIOS, for that internet connectivity is needed, I have to sync with @nicksinger or @jbaier_cz
Updated by gpathak 11 days ago
VIOS is installed on Power10 - paraffin.
Network configuration via mktcpip
per https://gitlab.suse.de/hsehic/qa-css-docs/-/blob/master/infrastructure/power9-configuration.md#vios-server-installation is pending.
Updated by gpathak 9 days ago
- File clipboard-202412031028-i9ctj.png clipboard-202412031028-i9ctj.png added
- File clipboard-202412031029-jrtdy.png clipboard-202412031029-jrtdy.png added
- File clipboard-202412031032-s92rl.png clipboard-202412031032-s92rl.png added
- File clipboard-202412031032-w26p2.png clipboard-202412031032-w26p2.png added
- Status changed from In Progress to Workable
Getting some issue while trying to add Logical Volume to the LPAR, the LPAR complains that the vhost associated with the ID isn't available
Checked and verified that the vhosts are available
Updated by gpathak 7 days ago
Since, Logical Volume wasn't getting attached to VIOS, I decided to re-install the VIOS. After deleting the previous VIOS, re-installation always gets suck at
Wed Dec 4 13:14:41 2024
-----------/var/log/nimol.log :---------------
Wed Dec 4 13:21:36 2024 nimol: installios: led code=Starting kernel : ,info=Installing and configuring.
There are no logs from AIX Kernel and installation doesn't proceeds further.
Updated by gpathak 7 days ago
@okurz @nicksinger
I need some help in copying this https://dist.suse.de/IBM/Power/VIOS_4.1.0.20/ VIOS image to HMC. The cpviosimg
command can be used to copy the VIOS image either via NFS or SFTP.
I downloaded the image on worker30.oqa.prg2.suse.org
in my home directory, but when trying to copy the image on HMC, getting The specified SFTP hostname is not valid.
Updated by okurz 5 days ago · Edited
- Possible alternative approaches to proceed:
- Collect all current problems, state them together and escalate that we can't proceed without help
- Organize a joint working session with PowerPC experts outside the team
- Only continue the work after multiple people from the team can join together
- Do more generic PowerPC training, e.g. ask experts from SUSE or also IBM -> #173863
- DONE Change the assignee of the ticket for change of perspective
- Consider squad rotation e.g. to UV squad who also manually administrate and work with PowerPC
Updated by okurz 5 days ago
- Copied to action #173863: Do more generic PowerPC training, e.g. ask experts from SUSE or also IBM added
Updated by okurz 5 days ago
- Copied to action #173866: Share the learnings from Power10 setup e.g. in blog post or workshop added
Updated by okurz about 12 hours ago
- Due date changed from 2024-12-10 to 2024-12-17
Primary goal: Have a plan going forward in new ticket(s) that the machine is usable and finally used in OSD. Secondary goal: Please take a look into the current state. If it's feasible get the VIOS and something like at least one LPAR running.
Updated by nicksinger about 12 hours ago
- Status changed from Workable to In Progress