Project

General

Profile

Actions

action #139112

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Ensure OSD openQA PowerPC machine grenache is operational from PRG2

Added by okurz about 1 year ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2023-06-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Most PowerPC machines are being setup in PRG2 within #132140 and most machines could be discovered from the HMC but apparently not grenache.

Acceptance criteria

Suggestions

Rollback steps

  • Add back salt key with salt-key -y -a grenache-1.qa.suse.de

Related issues 5 (1 open4 closed)

Related to openQA Tests - action #157447: ppc64le-spvm stops at grub page, timeout due to slow PXE traffic between PRG2 and NUE2?New2024-03-18

Actions
Related to openQA Infrastructure - action #158041: grenache needs upgrade to 15.5Resolvedokurz2024-03-262024-04-09

Actions
Copied from QA - action #132140: Support move of PowerPC machines to PRG2 size:MResolvedokurz2023-06-29

Actions
Copied to QA - action #139115: Ensure o3 openQA PowerPC machine qa-power8-3 is operational from PRG2 size:MResolvednicksinger2023-06-29

Actions
Copied to QA - action #139199: Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2 size:MResolvednicksinger2023-06-29

Actions
Actions #1

Updated by okurz about 1 year ago

  • Copied from action #132140: Support move of PowerPC machines to PRG2 size:M added
Actions #2

Updated by okurz about 1 year ago

  • Copied to action #139115: Ensure o3 openQA PowerPC machine qa-power8-3 is operational from PRG2 size:M added
Actions #3

Updated by okurz about 1 year ago

  • Description updated (diff)
Actions #4

Updated by okurz about 1 year ago

  • Copied to action #139199: Ensure OSD openQA PowerPC machine redcurrant is operational from PRG2 size:M added
Actions #5

Updated by okurz 10 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #6

Updated by okurz 10 months ago

  • Priority changed from Normal to Low
Actions #7

Updated by okurz 9 months ago · Edited

  • Status changed from Blocked to In Progress
  • Assignee changed from okurz to nicksinger
  • Priority changed from Low to High
  • Target version changed from future to Ready

grenache should be all connected properly but does not show up in powerhmc1.oqa.prg2.suse.org. Suggestions how to continue:

  1. Research how DHCP on HMC works
  2. Try to find logs or debug how DHCP works either over webUI (operator display) or ssh as unprivileged user or privileged user
  3. Consider a) upstream IBM documentation b) colleagues at SUSE c) bugzilla.suse.com bug to establish communication to IBM d) AI
  4. Try to find out if grenache MACs show up in DHCP requests or logs on powerhmc1.oqa.prg2.suse.org
  5. Maybe grenache isn't even supposed to show up because it should only run novalink. Crosscheck https://jira.suse.com/browse/ENGINFRA-3750?focusedId=1333119&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1333119

If all that is exhausted without results then I assume we need to wait for https://jira.suse.com/browse/ENGINFRA-4030 "suttner1.oqa.prg2.suse.org+suttner2.oqa.prg2.suse.org times are both out of sync with NTP" first.

Actions #8

Updated by openqa_review 9 months ago

  • Due date set to 2024-03-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by okurz 9 months ago

https://jira.suse.com/browse/ENGINFRA-4030 is still open now but times on both suttner1+2 are good and in sync and dhcpd also seems to be in sync. I can also now see that https://powerhmc1.oqa.prg2.suse.org/ shows grenache and – as expected – a red lock mentioning that HMC is not controlling the machine as grenache uses novalink. But I can't ping any of grenache-vios.oqa.prg2.suse.org or grenache-fsp1.oqa.prg2.suse.org or grenache-1.oqa.prg2.suse.org

Actions #10

Updated by nicksinger 9 months ago

The Jira task https://jira.suse.com/browse/ENGINFRA-3750 provided some very helpful pictures of the operators display. We should keep this for future installations.
The HMC for grenache already looked very good after the cabling was done. VIOS was able to start without any issues and with it the NovaLink LPAR. However, none of the LPARs had network. Gerhard from IT organized that we connect all 4 ports on the C11 network card in that system. After finding and resolving a typo in the VLAN configuration and restarting the systems, all mentioned LPARs had network. Still no RMC connection from either NovaLink or VIOS but this might not be needed after all for our use-case.

Currently ssh padmin@grenache.oqa.prg2.suse.org works and can be used to start LPARs like e.g. grenache-1 with lpar power-on -i name=grenache-1 - this is also what openQA does to start and stop LPARs.
Also grenache-1 can be reached via ssh grenache-1.oqa.prg2.suse.org.

My next steps are to validate PXE in the other LPARs (grenache-{2..8}) and bring the machine back into openQA.

Actions #11

Updated by nicksinger 9 months ago

PXE unfortunately does not work. I can ping qa-jump but the bootp request times out after 5 tries. I have yet to find out whats the culprit.

On a positive note: I was able to establish a fully working RMC connection with the VIOS after logging into it via a vterm (mkvterm -m grenache -p vios1 on HMC). On the VIOS I discovered with ifconfig -a that the hmc facing network port (en14) still had an old 10.162.x.x-IP. After elevating privileges with oem_setup_env I could execute smitty to get an interactive configuration menu. In there I went to "Communications Applications and Services" -> "TCP/IP" -> "Use DHCP for TCPIP Configuration & Startup" -> select en14 from the list -> "Use DHCP starting".

Next I will try to understand why PXE is not working. First test already shows that qa-jump can provide the requested "ppc64le/grub2" binary. I tested this on my local workstation with:

workstation ~ » atftp 10.168.192.10 69
tftp> get ppc64le/grub2
tftp>

workstation ~ » file grub2
grub2: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV), statically linked, stripped

I will now try to understand if the LPAR cannot access the server or if we have some more misconfiguration anywhere.

Actions #12

Updated by okurz 8 months ago

Discussed with nicksinger in daily infra call. With https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/740 we enabled pre-production worker classes to be used with openQA again.

Actions #13

Updated by nicksinger 8 months ago

Yesterday we found out that the PXE/bootp-process is different then expected. The "Server" one has to enter in the SMS menu is not the TFTP-Server-IP but rather the IP of the server handing out DHCP leases. With that I managed to get a successful openQA run: https://openqa.suse.de/tests/13762881 - however, TIMEOUT_SCALE=4 was needed because loading kernel+initrd takes quite some time.

Actions #14

Updated by nicksinger 8 months ago

  • Status changed from In Progress to Feedback

I prepared every LPAR by starting it once, entering 255.255.255.255 as "Server IP" in the bootup configuration and ran validation jobs against the LPAR slots:

This should provide us confidence to merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/742 which also adds a TIMEOUT_SCALE=4 to every grenache slot as workaround until https://jira.suse.com/browse/ENGINFRA-3941 is resolved.

Actions #15

Updated by okurz 8 months ago

  • Due date deleted (2024-03-22)
  • Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/742 merged and deployed. racktables updated. Rollback steps concluded. grenache-1 is back in salt and back in production. https://openqa.suse.de/admin/workers looking for grenache shows how jobs already execute on the spvm backend. Will announce in #eng-testing

Actions #16

Updated by jbaier_cz 8 months ago

  • Related to action #157447: ppc64le-spvm stops at grub page, timeout due to slow PXE traffic between PRG2 and NUE2? added
Actions #17

Updated by okurz 8 months ago

Actions

Also available in: Atom PDF