action #139199
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2 size:M
0%
Description
Motivation¶
Most PowerPC machines are being setup in PRG2 within #132140 and are at least discoverable from HMC. Now we can setup redcurrant as production openQA PowerVM worker in OSD again.
Acceptance criteria¶
- AC1: redcurrant openQA instances as referenced in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls are able to pass openQA jobs after the move to PRG2
Suggestions¶
- Read #132140 about the generic setup and in particular the HMC
- See current infrastructure management network entry https://racktables.nue.suse.com/index.php?page=object&object_id=11354 and child objects about the machine "redcurrant"
- Ensure we have access to redcurrant manually as well as with verification openQA jobs, both for osd
- Update the infrastructure management network entry, e.g. new FQDN
- Inform users about the result
- Crosscheck according alert silences on https://monitor.qa.suse.de/alerting/silences
Updated by okurz 6 months ago
- Copied from action #139112: Ensure OSD openQA PowerPC machine grenache is operational from PRG2 added
Updated by okurz 6 months ago
- Related to action #132140: Support move of PowerPC machines to PRG2 size:M added
Updated by nicksinger 2 months ago
Checking if something changed in the current setup. I tried to start an LPAR called "redcurrant-1" and monitored its boot process by connecting to the vterm if the machine via the hmc:
$ ssh nsinger@10.145.14.33
$ mkvterm -m redcurrant -p redcurrant-1
-> system complained about "No OS found". Now checking if the VIOS comes up properly by replacing redcurrant-1
with vios
. We have 2 other VIOS installed so maybe just the wrong one is booted. Currently it hangs at:
-------------------------------------------------------------------------------
Welcome to the Virtual I/O Server.
boot image timestamp: 12:22:52 11/18/2019
The current time and date: 13:26:26 02/14/2024
processor count: 1; memory size: 5120MB; kernel size: 45241275
boot device: /pci@800000020000015/pci1014,034A@0/sas/disk@160763df700
-------------------------------------------------------------------------------
Updated by okurz 2 months ago
- Status changed from Blocked to In Progress
Added redcurrant workerconf back with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/715
Updated by okurz 2 months ago · Edited
- Priority changed from Low to High
- Target version changed from future to Ready
Created test verification jobs
for i in {1..8}; do openqa-clone-job --within-instance https://openqa.suse.de/tests/13492073 WORKER_CLASS=redcurrant-$i TEST+=-redcurrant-$i-poo139199 _GROUP=0 BUILD= ;done
- sle-15-SP6-Online-ppc64le-Build54.1-default@ppc64le-hmc -> https://openqa.suse.de/tests/13524999
- sle-15-SP6-Online-ppc64le-Build54.1-default@ppc64le-hmc -> https://openqa.suse.de/tests/13525000
- sle-15-SP6-Online-ppc64le-Build54.1-default@ppc64le-hmc -> https://openqa.suse.de/tests/13525001
- sle-15-SP6-Online-ppc64le-Build54.1-default@ppc64le-hmc -> https://openqa.suse.de/tests/13525002
- sle-15-SP6-Online-ppc64le-Build54.1-default@ppc64le-hmc -> https://openqa.suse.de/tests/13525003
- sle-15-SP6-Online-ppc64le-Build54.1-default@ppc64le-hmc -> https://openqa.suse.de/tests/13525004
- sle-15-SP6-Online-ppc64le-Build54.1-default@ppc64le-hmc -> https://openqa.suse.de/tests/13525005
- sle-15-SP6-Online-ppc64le-Build54.1-default@ppc64le-hmc -> https://openqa.suse.de/tests/13525006
Updated by okurz 2 months ago
(Oliver Kurz) https://openqa.suse.de/tests/13524999 … https://openqa.suse.de/tests/13525006 look promising. It seems we have network but no BOOTP. So next step needed is likely to add the according options to the host config in https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml?ref_type=heads#L297 and following to boot grub over network. @Gerhard Schlotter @Ignacio Torres does suttner1/2 already feature network PXE support?
Updated by okurz about 2 months ago
- Status changed from Blocked to In Progress
- Assignee changed from okurz to nicksinger
- Priority changed from Normal to High
#155521 resolved. Now the VIOS for redcurrant needs to be fixed as mentioned in #139199-8, #155521-24 and https://jira.suse.com/browse/ENGINFRA-3751
Updated by openqa_review about 2 months ago
- Due date set to 2024-03-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 2 months ago
As the VIOS installation still fails
Next steps:
- Ask in #discuss-powerpc-architecture who can understand why the installation fails to access the network in #155521-18
- Try again if we actually see DHCP (maybe also BOOTP specifically) requests from redcurrant vios on our oqa.prg2.suse.org DHCP servers suttner1+2
- Ask for firewall access to crosscheck if requirements from https://www.ibm.com/support/pages/ports-and-protocols-be-allowed-firewall-installios are fulfilled or if related traffic is blocked or if IT denies them ask them to fix the VIOS installation on redcurrant themselves and give them all credentials they need
- Because so far people only seem to get a VIOS installation done when the HMC is in the same network as the system ethernet ask IT to connect redcurrant system ethernet to the same network as the HMC, reconduct the installation there, and reconfigure the system ethernet after that if successful
- Ask IBM partner contacts for help
Updated by nicksinger about 2 months ago
While collecting data for 1) and trying some more stuff out which I learned from e.g. https://progress.opensuse.org/issues/139112 I realized that redcurrant and blackcurrant constantly reconnect with the HMC. This renders the machines almost unusable. I tried to reset the management connection, re-add the systems, reboot the physical systems, reboot the HMC but nothing helped. Reached out to #dct-migration if they can check if the switchports are flapping.
Updated by nicksinger about 1 month ago
I managed to get redcurrant in a working state again by removing all HMC connections in the ASM first, note down one of the ASMs IPs (10.255.255.190 in this case) and remove the machine from the HMC web interface itself.
After manually adding only the one, single IP the system connected back to the HMC and didn't "flicker" any longer. After that I could conduct another try of installios
which I tracked here: https://paste.opensuse.org/pastes/367d26f0b2b2. I used that to ask in #discuss-powerpc-architecture about further ideas how I could debug: https://suse.slack.com/archives/C04K6388YUX/p1710765088843389
I then quickly crosschecked my installation method by letting Gerhard move all interfaces of redcurrant temporarily into the HMC network. After adjusting the IPs slightly, the HMC was able to conduct a successful VIOS installation on redcurrant now. Gerhard then moved the interfaces back into the oqa.prg2 network. Currently LPARs on grenache are not able to reach the network so we now might need to reconfigure the internal networking of redcurrant.
I also asked Lazaros to crosscheck our requirements from https://www.ibm.com/support/pages/ports-and-protocols-be-allowed-firewall-installios in #dct-migration: https://suse.slack.com/archives/C04MDKHQE20/p1710774211903239
Updated by nicksinger about 1 month ago
LPARs currently still without network. I struggle to replicate the same setup as on grenache because grenache seems to be "co managed" only which reduces the options in the HMC.
LPARs are already connected to the "VLAN2-ETHERNET0" virtual network but fail to reach anything with it. I assume we miss the actual connection between the real world and this virtual network. Trying to bridge it didn't work because the HMC complains about an interface being in "DHCP Mode" (I assume it means the one in VIOS). I will try a static configuration and see if I can configure it this way then.
Updated by okurz about 1 month ago
- Subject changed from Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2 to Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2 size:M
Updated by nicksinger about 1 month ago
VIOS dhcp disabled by logging onto the VIOS via mkvterm on HMC and changing mode with oem_setup_env
. First, disable dhcpcd with: stopsrc -s dhcpcd
, then de-configure all interfaces visible in ifconfig -a
(except "lo") with: ifconfig en3 down
and ifconfig en3 detach
. After each interface is down, verify and stop the tcpip-stack with rmtcpip
. Then reconfigure the last (physical) interface of redcurrant (-T4
device suffix, en3/ent3 in VIOS) with:
# mktcpip -h redcurrant-vios -a 10.145.10.237 -i en3 -n 10.144.53.53 -d oqa.prg2.suse.org -m 255.255.255.0 -g 10.145.10.254
# mktcpip
Usage: mktcpip {-S interface | -h hostname -a address -i interface
[-n nameserver_address -d domain] [-m subnet_mask]
[-g gateway_address]] [-t cable_type] [-r ring_speed]
[-c subchannel -D destination_address] [-s]}
-h hostname Hostname of your machine
-a address IP address of the interface specified by -i
(must be in dotted decimal notation)
-i interface Interface to associate with the -a IP address
-n nameserver_address IP address of nameserver machine
(must be in dotted decimal notation)
-d domain Domain name, only use with -n
-m subnet_mask Subnetwork mask (dotted decimal or 0x notation)
-g gateway_address Gateway destination address
(dotted decimal or symbolic name)
-t cable_type Cable type for ethernet, either 'bnc', 'dix', 'tp'
or 'N/A'
-r ring_speed Ring speed for token ring adapter, either 4 or 16
-s Start the TCP/IP daemon
-c subchannel Subchannel Address for 370 Channel Adapter
-D destination_address Destination IP address for 370 Channel Adapter
-S interface Retrieve information for SMIT display
Example: /usr/sbin/mktcpip -h fred.austin.ibm.com -a 192.9.200.9 -i tr0
RMC connection to HMC is critical to follow up with LPAR network configuration so ensure it works (how to reset it without reboot: https://www.ibm.com/support/pages/how-resetrestart-rmc-server-vios). Back in the HMC in the view of the managed machine (redcurrant) all essential configuration can be found under "Power VM -> Virtual Networks". I created a new "Virtual Network" and followed the wizard which created a new "Virtual Switch" in accordance. This network was then bridged to a physical adapter port (-T3 suffix, second port on the 4-port-adapter of grenache). Every LPAR then needs to be connected to the newly created virtual network. Selecting the LPAR view in the HMC, one can add a new connection under "Virtual I/O -> Virtual Networks -> Attach Virtual Network". An according LPAR Adapter is created which can be viewed on the same page as the networks by clicking the little gray arrow on the top. Selecting the new adapter "Action -> View Virtual Ethernet Adapter Settings" the new MAC of that interface can be seen. As it is bridged with the physical adapter, this will be visible in our infrastructure. I created this for all LPARs and adjusted our config accordingly: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4938
Things to still check:
- will the VIOS use DHCP again after a reboot? How to check?
Updated by nicksinger about 1 month ago
- Status changed from In Progress to Feedback
Next step after the MR is merged: Test if the LPARs can network boot and enable them again in openQA
Updated by okurz about 1 month ago
- Status changed from Feedback to In Progress
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4938 deployed, nicksinger is validating
Updated by okurz about 1 month ago
- Copied to action #157777: Provide more consistent PowerPC openQA ressources by migrating all novalink instances to hmc size:M added
Updated by nicksinger about 1 month ago
We have a first OpenQA validation run: https://openqa.suse.de/tests/13845946
It shows that we also lost the disk configuration because of the VIOS re-installation. I used the following commands to create a new "Storage Pool" as the HMC had problems creating one because the physical discs contained fragments of the old installation/assignment:
ssh padmin@redcurrant-vios4.oqa.prg2.suse.org
$ lspv
NAME PVID VG STATUS
hdisk0 0008201a64fa66f6 None
hdisk1 0008201af8f5b71e rootvg active
hdisk2 0008201a886cd4bd lparvg active
$ mksp -f lparvg hdisk2
Afterwards, in the machine view -> Virtual Storage -> Select redcurrant-vios -> Action -> Manage Virtual Storage one can create Virtual Disks. Disks created here cannot be assigned to LPARs directly but apparently one (unassigned) disk must exist for the LPAR to see the Storage Pool so I created a "redcurrant-1" disk here to assign later. I opened the LPAR view for redcurrant-2 and up and followed "Virtual I/O -> Virtual Storage -> Logical Volume -> Add Lovical Volume" to get a wizard which allows the creating of more Virtual Disks in the existing Pool for lpars.
Updated by nicksinger about 1 month ago
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/751 with validation runs for every worker instance:
https://openqa.suse.de/tests/13846092
https://openqa.suse.de/tests/13846093
https://openqa.suse.de/tests/13846094
https://openqa.suse.de/tests/13846095
https://openqa.suse.de/tests/13846096
https://openqa.suse.de/tests/13846097
https://openqa.suse.de/tests/13846098
https://openqa.suse.de/tests/13846099
hmc_ppc64le-4disk
is still disabled because currently only a single disk is configured per LPAR
Updated by okurz about 1 month ago · Edited
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/751 merged. Please monitor the effect on production.
Wrote in https://suse.slack.com/archives/C02CANHLANP/p1711312168973729
@here Finally with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/751 we have PowerVM using the "hmc" backend back in production on OSD. The machine redcurrant will pick up jobs that had been scheduled on production. Please bring back the schedule for PowerVM tests using the "pvm_hmc" backend and report issues like control workers unable to establish a contact to the test instances. See https://progress.opensuse.org/issues/139199 for details
Updated by nicksinger about 1 month ago
- Status changed from Feedback to Resolved
I had to follow-up with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/758 but with that merged I can see production-jobs passing and reaching stages after PXE. That now covers AC1.
Updated by okurz 25 days ago
- Related to action #153718: Move of LSG QE non-openQA PowerPC machine NUE1 to PRG2 - haldir size:M added
Updated by okurz 10 days ago
- Copied to action #159231: Bring back worker class "hmc_ppc64le-4disk" on redcurrant or another machine size:M added