Project

General

Profile

Actions

action #139199

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2 size:M

Added by okurz 6 months ago. Updated 25 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2023-06-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Most PowerPC machines are being setup in PRG2 within #132140 and are at least discoverable from HMC. Now we can setup redcurrant as production openQA PowerVM worker in OSD again.

Acceptance criteria

Suggestions


Related issues 5 (4 open1 closed)

Related to QA - action #132140: Support move of PowerPC machines to PRG2 size:MBlockedokurz2023-06-29

Actions
Related to QA - action #153718: Move of LSG QE non-openQA PowerPC machine NUE1 to PRG2 - haldir size:MIn Progressnicksinger2024-01-162024-05-01

Actions
Copied from QA - action #139112: Ensure OSD openQA PowerPC machine grenache is operational from PRG2Resolvednicksinger2023-06-29

Actions
Copied to QA - action #157777: Provide more consistent PowerPC openQA ressources by migrating all novalink instances to hmc size:MBlockedokurz

Actions
Copied to QA - action #159231: Bring back worker class "hmc_ppc64le-4disk" on redcurrant or another machine size:MWorkable

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #139112: Ensure OSD openQA PowerPC machine grenache is operational from PRG2 added
Actions #2

Updated by okurz 6 months ago

  • Related to action #132140: Support move of PowerPC machines to PRG2 size:M added
Actions #3

Updated by okurz 5 months ago

  • Target version changed from future to Tools - Next
Actions #4

Updated by okurz 5 months ago

  • Target version changed from Tools - Next to Ready
Actions #5

Updated by okurz 5 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version changed from Ready to future

Blocking on #139115 first as we need to first ensure we can work with PowerPC in PRG2.

Actions #6

Updated by okurz 4 months ago

  • Priority changed from Normal to Low

blocked on #139115 which is Low and can not be higher due to second-level dependencies on SUSE-IT. Without further escalation hence this ticket here can also not be higher than "Low" priority.

Actions #8

Updated by nicksinger 2 months ago

Checking if something changed in the current setup. I tried to start an LPAR called "redcurrant-1" and monitored its boot process by connecting to the vterm if the machine via the hmc:

$ ssh nsinger@10.145.14.33
$ mkvterm -m redcurrant -p redcurrant-1

-> system complained about "No OS found". Now checking if the VIOS comes up properly by replacing redcurrant-1 with vios. We have 2 other VIOS installed so maybe just the wrong one is booted. Currently it hangs at:

-------------------------------------------------------------------------------                                                            
                       Welcome to the Virtual I/O Server.                                                                                  
                   boot image timestamp: 12:22:52 11/18/2019                                                                               
                 The current time and date: 13:26:26 02/14/2024                                                                            
        processor count: 1;  memory size: 5120MB;  kernel size: 45241275                                                                   
     boot device: /pci@800000020000015/pci1014,034A@0/sas/disk@160763df700                                                                 
-------------------------------------------------------------------------------                                                            
Actions #9

Updated by okurz 2 months ago

  • Status changed from Blocked to In Progress
Actions #10

Updated by okurz 2 months ago · Edited

  • Priority changed from Low to High
  • Target version changed from future to Ready

Created test verification jobs

for i in {1..8}; do openqa-clone-job --within-instance https://openqa.suse.de/tests/13492073 WORKER_CLASS=redcurrant-$i TEST+=-redcurrant-$i-poo139199 _GROUP=0 BUILD= ;done
Actions #11

Updated by okurz 2 months ago

https://suse.slack.com/archives/C02CANHLANP/p1707984316167889?thread_ts=1707981321.054639&cid=C02CANHLANP

(Oliver Kurz) https://openqa.suse.de/tests/13524999https://openqa.suse.de/tests/13525006 look promising. It seems we have network but no BOOTP. So next step needed is likely to add the according options to the host config in https://gitlab.suse.de/OPS-Service/salt/-/blob/production/pillar/domain/oqa_prg2_suse_org/hosts.yaml?ref_type=heads#L297 and following to boot grub over network. @Gerhard Schlotter @Ignacio Torres does suttner1/2 already feature network PXE support?

Actions #13

Updated by okurz 2 months ago

  • Status changed from In Progress to Blocked

Created #155521, blocking on that

Actions #14

Updated by okurz 2 months ago

  • Priority changed from High to Normal

Still blocked by #155521 which is waiting for a MR to be merged

Actions #15

Updated by okurz about 2 months ago

  • Status changed from Blocked to In Progress
  • Assignee changed from okurz to nicksinger
  • Priority changed from Normal to High

#155521 resolved. Now the VIOS for redcurrant needs to be fixed as mentioned in #139199-8, #155521-24 and https://jira.suse.com/browse/ENGINFRA-3751

Actions #16

Updated by openqa_review about 2 months ago

  • Due date set to 2024-03-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by okurz about 2 months ago

As the VIOS installation still fails

Next steps:

  1. Ask in #discuss-powerpc-architecture who can understand why the installation fails to access the network in #155521-18
  2. Try again if we actually see DHCP (maybe also BOOTP specifically) requests from redcurrant vios on our oqa.prg2.suse.org DHCP servers suttner1+2
  3. Ask for firewall access to crosscheck if requirements from https://www.ibm.com/support/pages/ports-and-protocols-be-allowed-firewall-installios are fulfilled or if related traffic is blocked or if IT denies them ask them to fix the VIOS installation on redcurrant themselves and give them all credentials they need
  4. Because so far people only seem to get a VIOS installation done when the HMC is in the same network as the system ethernet ask IT to connect redcurrant system ethernet to the same network as the HMC, reconduct the installation there, and reconfigure the system ethernet after that if successful
  5. Ask IBM partner contacts for help
Actions #18

Updated by nicksinger about 2 months ago

While collecting data for 1) and trying some more stuff out which I learned from e.g. https://progress.opensuse.org/issues/139112 I realized that redcurrant and blackcurrant constantly reconnect with the HMC. This renders the machines almost unusable. I tried to reset the management connection, re-add the systems, reboot the physical systems, reboot the HMC but nothing helped. Reached out to #dct-migration if they can check if the switchports are flapping.

Actions #19

Updated by nicksinger about 1 month ago

I managed to get redcurrant in a working state again by removing all HMC connections in the ASM first, note down one of the ASMs IPs (10.255.255.190 in this case) and remove the machine from the HMC web interface itself.
After manually adding only the one, single IP the system connected back to the HMC and didn't "flicker" any longer. After that I could conduct another try of installios which I tracked here: https://paste.opensuse.org/pastes/367d26f0b2b2. I used that to ask in #discuss-powerpc-architecture about further ideas how I could debug: https://suse.slack.com/archives/C04K6388YUX/p1710765088843389

I then quickly crosschecked my installation method by letting Gerhard move all interfaces of redcurrant temporarily into the HMC network. After adjusting the IPs slightly, the HMC was able to conduct a successful VIOS installation on redcurrant now. Gerhard then moved the interfaces back into the oqa.prg2 network. Currently LPARs on grenache are not able to reach the network so we now might need to reconfigure the internal networking of redcurrant.

I also asked Lazaros to crosscheck our requirements from https://www.ibm.com/support/pages/ports-and-protocols-be-allowed-firewall-installios in #dct-migration: https://suse.slack.com/archives/C04MDKHQE20/p1710774211903239

Actions #20

Updated by nicksinger about 1 month ago

LPARs currently still without network. I struggle to replicate the same setup as on grenache because grenache seems to be "co managed" only which reduces the options in the HMC.
LPARs are already connected to the "VLAN2-ETHERNET0" virtual network but fail to reach anything with it. I assume we miss the actual connection between the real world and this virtual network. Trying to bridge it didn't work because the HMC complains about an interface being in "DHCP Mode" (I assume it means the one in VIOS). I will try a static configuration and see if I can configure it this way then.

Actions #21

Updated by okurz about 1 month ago

  • Subject changed from Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2 to Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2 size:M
Actions #22

Updated by nicksinger about 1 month ago

VIOS dhcp disabled by logging onto the VIOS via mkvterm on HMC and changing mode with oem_setup_env. First, disable dhcpcd with: stopsrc -s dhcpcd, then de-configure all interfaces visible in ifconfig -a (except "lo") with: ifconfig en3 down and ifconfig en3 detach. After each interface is down, verify and stop the tcpip-stack with rmtcpip. Then reconfigure the last (physical) interface of redcurrant (-T4 device suffix, en3/ent3 in VIOS) with:

# mktcpip -h redcurrant-vios -a 10.145.10.237 -i en3 -n 10.144.53.53 -d oqa.prg2.suse.org -m 255.255.255.0 -g 10.145.10.254
# mktcpip
    Usage:  mktcpip {-S interface | -h hostname -a address -i interface
                [-n nameserver_address -d domain] [-m subnet_mask]
                [-g gateway_address]] [-t cable_type] [-r ring_speed]
                [-c subchannel -D destination_address] [-s]}

    -h hostname            Hostname of your machine
    -a address             IP address of the interface specified by -i
                           (must be in dotted decimal notation)
    -i interface           Interface to associate with the -a IP address
    -n nameserver_address  IP address of nameserver machine
                           (must be in dotted decimal notation)
    -d domain              Domain name, only use with -n
    -m subnet_mask         Subnetwork mask (dotted decimal or 0x notation)
    -g gateway_address     Gateway destination address
                           (dotted decimal or symbolic name)
    -t cable_type          Cable type for ethernet, either 'bnc', 'dix', 'tp'
                           or 'N/A'
    -r ring_speed          Ring speed for token ring adapter, either 4 or 16
    -s                     Start the TCP/IP daemon
    -c subchannel          Subchannel Address for 370 Channel Adapter
    -D destination_address Destination IP address for 370 Channel Adapter
    -S interface           Retrieve information for SMIT display

    Example: /usr/sbin/mktcpip -h fred.austin.ibm.com -a 192.9.200.9 -i tr0

RMC connection to HMC is critical to follow up with LPAR network configuration so ensure it works (how to reset it without reboot: https://www.ibm.com/support/pages/how-resetrestart-rmc-server-vios). Back in the HMC in the view of the managed machine (redcurrant) all essential configuration can be found under "Power VM -> Virtual Networks". I created a new "Virtual Network" and followed the wizard which created a new "Virtual Switch" in accordance. This network was then bridged to a physical adapter port (-T3 suffix, second port on the 4-port-adapter of grenache). Every LPAR then needs to be connected to the newly created virtual network. Selecting the LPAR view in the HMC, one can add a new connection under "Virtual I/O -> Virtual Networks -> Attach Virtual Network". An according LPAR Adapter is created which can be viewed on the same page as the networks by clicking the little gray arrow on the top. Selecting the new adapter "Action -> View Virtual Ethernet Adapter Settings" the new MAC of that interface can be seen. As it is bridged with the physical adapter, this will be visible in our infrastructure. I created this for all LPARs and adjusted our config accordingly: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4938

Things to still check:

  • will the VIOS use DHCP again after a reboot? How to check?
Actions #23

Updated by nicksinger about 1 month ago

  • Status changed from In Progress to Feedback

Next step after the MR is merged: Test if the LPARs can network boot and enable them again in openQA

Actions #24

Updated by okurz about 1 month ago

  • Status changed from Feedback to In Progress
Actions #25

Updated by okurz about 1 month ago

  • Copied to action #157777: Provide more consistent PowerPC openQA ressources by migrating all novalink instances to hmc size:M added
Actions #26

Updated by nicksinger about 1 month ago

We have a first OpenQA validation run: https://openqa.suse.de/tests/13845946
It shows that we also lost the disk configuration because of the VIOS re-installation. I used the following commands to create a new "Storage Pool" as the HMC had problems creating one because the physical discs contained fragments of the old installation/assignment:

ssh padmin@redcurrant-vios4.oqa.prg2.suse.org
$ lspv
NAME             PVID                                 VG               STATUS
hdisk0           0008201a64fa66f6                     None
hdisk1           0008201af8f5b71e                     rootvg           active
hdisk2           0008201a886cd4bd                     lparvg           active
$ mksp -f lparvg hdisk2

Afterwards, in the machine view -> Virtual Storage -> Select redcurrant-vios -> Action -> Manage Virtual Storage one can create Virtual Disks. Disks created here cannot be assigned to LPARs directly but apparently one (unassigned) disk must exist for the LPAR to see the Storage Pool so I created a "redcurrant-1" disk here to assign later. I opened the LPAR view for redcurrant-2 and up and followed "Virtual I/O -> Virtual Storage -> Logical Volume -> Add Lovical Volume" to get a wizard which allows the creating of more Virtual Disks in the existing Pool for lpars.

Actions #28

Updated by okurz about 1 month ago · Edited

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/751 merged. Please monitor the effect on production.

Wrote in https://suse.slack.com/archives/C02CANHLANP/p1711312168973729

@here Finally with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/751 we have PowerVM using the "hmc" backend back in production on OSD. The machine redcurrant will pick up jobs that had been scheduled on production. Please bring back the schedule for PowerVM tests using the "pvm_hmc" backend and report issues like control workers unable to establish a contact to the test instances. See https://progress.opensuse.org/issues/139199 for details

Actions #29

Updated by nicksinger about 1 month ago

  • Status changed from Feedback to Resolved

I had to follow-up with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/758 but with that merged I can see production-jobs passing and reaching stages after PXE. That now covers AC1.

Actions #30

Updated by okurz 25 days ago

  • Related to action #153718: Move of LSG QE non-openQA PowerPC machine NUE1 to PRG2 - haldir size:M added
Actions #31

Updated by okurz 25 days ago

  • Due date deleted (2024-03-27)
Actions #32

Updated by okurz 10 days ago

  • Copied to action #159231: Bring back worker class "hmc_ppc64le-4disk" on redcurrant or another machine size:M added
Actions

Also available in: Atom PDF