action #139112
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Ensure OSD openQA PowerPC machine grenache is operational from PRG2
Most PowerPC machines are being setup in PRG2 within #132140 and most machines could be discovered from the HMC but apparently not grenache.
Acceptance criteria¶
- AC1: grenache openQA instances as referenced in are able to pass openQA jobs after the move to PRG2
- Read #132140 about the generic setup and in particular the HMC
- See current infrastructure management network entry about the machine "grenache"
- Ensure we have access to grenache manually as well as with verification openQA jobs, both for osd
- Update the infrastructure management network entry accordingly, e.g. new FQDN
- Inform users about the result
Rollback steps¶
- Add back salt key with
salt-key -y -a
Updated by okurz over 1 year ago
- Copied to action #139115: Ensure o3 openQA PowerPC machine qa-power8-3 is operational from PRG2 size:M added
Updated by okurz about 1 year ago
- Status changed from New to Blocked
- Assignee set to okurz
Updated by okurz 12 months ago · Edited
- Status changed from Blocked to In Progress
- Assignee changed from okurz to nicksinger
- Priority changed from Low to High
- Target version changed from future to Ready
grenache should be all connected properly but does not show up in Suggestions how to continue:
- Research how DHCP on HMC works
- Try to find logs or debug how DHCP works either over webUI (operator display) or ssh as unprivileged user or privileged user
- Consider a) upstream IBM documentation b) colleagues at SUSE c) bug to establish communication to IBM d) AI
- Try to find out if grenache MACs show up in DHCP requests or logs on
- Maybe grenache isn't even supposed to show up because it should only run novalink. Crosscheck
If all that is exhausted without results then I assume we need to wait for " times are both out of sync with NTP" first.
Updated by openqa_review 12 months ago
- Due date set to 2024-03-22
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 12 months ago is still open now but times on both suttner1+2 are good and in sync and dhcpd also seems to be in sync. I can also now see that shows grenache and – as expected – a red lock mentioning that HMC is not controlling the machine as grenache uses novalink. But I can't ping any of or or
Updated by nicksinger 12 months ago
The Jira task provided some very helpful pictures of the operators display. We should keep this for future installations.
The HMC for grenache already looked very good after the cabling was done. VIOS was able to start without any issues and with it the NovaLink LPAR. However, none of the LPARs had network. Gerhard from IT organized that we connect all 4 ports on the C11 network card in that system. After finding and resolving a typo in the VLAN configuration and restarting the systems, all mentioned LPARs had network. Still no RMC connection from either NovaLink or VIOS but this might not be needed after all for our use-case.
Currently ssh
works and can be used to start LPARs like e.g. grenache-1 with lpar power-on -i name=grenache-1
- this is also what openQA does to start and stop LPARs.
Also grenache-1 can be reached via ssh
My next steps are to validate PXE in the other LPARs (grenache-{2..8}) and bring the machine back into openQA.
Updated by nicksinger 12 months ago
PXE unfortunately does not work. I can ping qa-jump but the bootp request times out after 5 tries. I have yet to find out whats the culprit.
On a positive note: I was able to establish a fully working RMC connection with the VIOS after logging into it via a vterm (mkvterm -m grenache -p vios1
on HMC). On the VIOS I discovered with ifconfig -a
that the hmc facing network port (en14) still had an old 10.162.x.x-IP. After elevating privileges with oem_setup_env
I could execute smitty
to get an interactive configuration menu. In there I went to "Communications Applications and Services" -> "TCP/IP" -> "Use DHCP for TCPIP Configuration & Startup" -> select en14 from the list -> "Use DHCP starting".
Next I will try to understand why PXE is not working. First test already shows that qa-jump can provide the requested "ppc64le/grub2" binary. I tested this on my local workstation with:
workstation ~ » atftp 69
tftp> get ppc64le/grub2
workstation ~ » file grub2
grub2: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV), statically linked, stripped
I will now try to understand if the LPAR cannot access the server or if we have some more misconfiguration anywhere.
Updated by okurz 12 months ago
Discussed with nicksinger in daily infra call. With we enabled pre-production worker classes to be used with openQA again.
Updated by nicksinger 12 months ago
Yesterday we found out that the PXE/bootp-process is different then expected. The "Server" one has to enter in the SMS menu is not the TFTP-Server-IP but rather the IP of the server handing out DHCP leases. With that I managed to get a successful openQA run: - however, TIMEOUT_SCALE=4 was needed because loading kernel+initrd takes quite some time.
Updated by nicksinger 12 months ago
- Status changed from In Progress to Feedback
I prepared every LPAR by starting it once, entering as "Server IP" in the bootup configuration and ran validation jobs against the LPAR slots:
This should provide us confidence to merge which also adds a TIMEOUT_SCALE=4 to every grenache slot as workaround until is resolved.
Updated by okurz 12 months ago
- Due date deleted (
2024-03-22) - Status changed from Feedback to Resolved merged and deployed. racktables updated. Rollback steps concluded. grenache-1 is back in salt and back in production. looking for grenache shows how jobs already execute on the spvm backend. Will announce in #eng-testing
Updated by jbaier_cz 12 months ago
- Related to action #157447: ppc64le-spvm stops at grub page, timeout due to slow PXE traffic between PRG2 and NUE2? added
Updated by okurz 11 months ago
- Related to action #158041: grenache needs upgrade to 15.5 added