action #139112
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Ensure OSD openQA PowerPC machine grenache is operational from PRG2
0%
Description
Motivation¶
Most PowerPC machines are being setup in PRG2 within #132140 and most machines could be discovered from the HMC but apparently not grenache.
Acceptance criteria¶
- AC1: grenache openQA instances as referenced in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls are able to pass openQA jobs after the move to PRG2
Suggestions¶
- Read #132140 about the generic setup and in particular the HMC
- See current infrastructure management network entry https://racktables.nue.suse.com/index.php?page=object&object_id=3120 about the machine "grenache"
- Ensure we have access to grenache manually as well as with verification openQA jobs, both for osd
- Update the infrastructure management network entry https://racktables.nue.suse.com/index.php?page=object&object_id=3120 accordingly, e.g. new FQDN
- Inform users about the result
Rollback steps¶
- Add back salt key with
salt-key -y -a grenache-1.qa.suse.de
Updated by okurz about 1 year ago
- Copied to action #139115: Ensure o3 openQA PowerPC machine qa-power8-3 is operational from PRG2 size:M added
Updated by okurz 10 months ago · Edited
- Status changed from Blocked to In Progress
- Assignee changed from okurz to nicksinger
- Priority changed from Low to High
- Target version changed from future to Ready
grenache should be all connected properly but does not show up in powerhmc1.oqa.prg2.suse.org. Suggestions how to continue:
- Research how DHCP on HMC works
- Try to find logs or debug how DHCP works either over webUI (operator display) or ssh as unprivileged user or privileged user
- Consider a) upstream IBM documentation b) colleagues at SUSE c) bugzilla.suse.com bug to establish communication to IBM d) AI
- Try to find out if grenache MACs show up in DHCP requests or logs on powerhmc1.oqa.prg2.suse.org
- Maybe grenache isn't even supposed to show up because it should only run novalink. Crosscheck https://jira.suse.com/browse/ENGINFRA-3750?focusedId=1333119&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1333119
If all that is exhausted without results then I assume we need to wait for https://jira.suse.com/browse/ENGINFRA-4030 "suttner1.oqa.prg2.suse.org+suttner2.oqa.prg2.suse.org times are both out of sync with NTP" first.
Updated by openqa_review 10 months ago
- Due date set to 2024-03-22
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 10 months ago
https://jira.suse.com/browse/ENGINFRA-4030 is still open now but times on both suttner1+2 are good and in sync and dhcpd also seems to be in sync. I can also now see that https://powerhmc1.oqa.prg2.suse.org/ shows grenache and – as expected – a red lock mentioning that HMC is not controlling the machine as grenache uses novalink. But I can't ping any of grenache-vios.oqa.prg2.suse.org or grenache-fsp1.oqa.prg2.suse.org or grenache-1.oqa.prg2.suse.org
Updated by nicksinger 10 months ago
The Jira task https://jira.suse.com/browse/ENGINFRA-3750 provided some very helpful pictures of the operators display. We should keep this for future installations.
The HMC for grenache already looked very good after the cabling was done. VIOS was able to start without any issues and with it the NovaLink LPAR. However, none of the LPARs had network. Gerhard from IT organized that we connect all 4 ports on the C11 network card in that system. After finding and resolving a typo in the VLAN configuration and restarting the systems, all mentioned LPARs had network. Still no RMC connection from either NovaLink or VIOS but this might not be needed after all for our use-case.
Currently ssh padmin@grenache.oqa.prg2.suse.org
works and can be used to start LPARs like e.g. grenache-1 with lpar power-on -i name=grenache-1
- this is also what openQA does to start and stop LPARs.
Also grenache-1 can be reached via ssh grenache-1.oqa.prg2.suse.org
.
My next steps are to validate PXE in the other LPARs (grenache-{2..8}) and bring the machine back into openQA.
Updated by nicksinger 10 months ago
PXE unfortunately does not work. I can ping qa-jump but the bootp request times out after 5 tries. I have yet to find out whats the culprit.
On a positive note: I was able to establish a fully working RMC connection with the VIOS after logging into it via a vterm (mkvterm -m grenache -p vios1
on HMC). On the VIOS I discovered with ifconfig -a
that the hmc facing network port (en14) still had an old 10.162.x.x-IP. After elevating privileges with oem_setup_env
I could execute smitty
to get an interactive configuration menu. In there I went to "Communications Applications and Services" -> "TCP/IP" -> "Use DHCP for TCPIP Configuration & Startup" -> select en14 from the list -> "Use DHCP starting".
Next I will try to understand why PXE is not working. First test already shows that qa-jump can provide the requested "ppc64le/grub2" binary. I tested this on my local workstation with:
workstation ~ » atftp 10.168.192.10 69
tftp> get ppc64le/grub2
tftp>
workstation ~ » file grub2
grub2: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV), statically linked, stripped
I will now try to understand if the LPAR cannot access the server or if we have some more misconfiguration anywhere.
Updated by okurz 10 months ago
Discussed with nicksinger in daily infra call. With https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/740 we enabled pre-production worker classes to be used with openQA again.
Updated by nicksinger 10 months ago
Yesterday we found out that the PXE/bootp-process is different then expected. The "Server" one has to enter in the SMS menu is not the TFTP-Server-IP but rather the IP of the server handing out DHCP leases. With that I managed to get a successful openQA run: https://openqa.suse.de/tests/13762881 - however, TIMEOUT_SCALE=4 was needed because loading kernel+initrd takes quite some time.
Updated by nicksinger 10 months ago
- Status changed from In Progress to Feedback
I prepared every LPAR by starting it once, entering 255.255.255.255 as "Server IP" in the bootup configuration and ran validation jobs against the LPAR slots:
- https://openqa.suse.de/tests/13764653
- https://openqa.suse.de/tests/13764655
- https://openqa.suse.de/tests/13764656
- https://openqa.suse.de/tests/13764657
- https://openqa.suse.de/tests/13764658
- https://openqa.suse.de/tests/13764660
- https://openqa.suse.de/tests/13764661
This should provide us confidence to merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/742 which also adds a TIMEOUT_SCALE=4 to every grenache slot as workaround until https://jira.suse.com/browse/ENGINFRA-3941 is resolved.
Updated by okurz 10 months ago
- Due date deleted (
2024-03-22) - Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/742 merged and deployed. racktables updated. Rollback steps concluded. grenache-1 is back in salt and back in production. https://openqa.suse.de/admin/workers looking for grenache shows how jobs already execute on the spvm backend. Will announce in #eng-testing
Updated by jbaier_cz 9 months ago
- Related to action #157447: ppc64le-spvm stops at grub page, timeout due to slow PXE traffic between PRG2 and NUE2? added
Updated by okurz 9 months ago
- Related to action #158041: grenache needs upgrade to 15.5 added