Project

General

Profile

Actions

action #119059

open

Use qa-power8 for ppc tests in o3 - network connected?

Added by okurz over 1 year ago. Updated 9 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

The machine "power8" is currently not available for o3 meaning there is currently no ppc testing in o3 at all. Use the existing machine qa-power8 for o3 ppc testing

Acceptance criteria

  • AC0: We know what kind of setup we need (baremetal vs. HMC managed, e.g. ask test maintainers)
  • AC1: At least one ppc openQA test from openqa.opensuse.org passed after running on qa-power8 (reduced schedule is ok)

Suggestions

  • DONE: Find out BMC details -> covered with https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/43
  • Power on the machine over BMC https://qa-power8.qa.suse.de
  • Ensure there is a physical network connection as according to https://progress.opensuse.org/issues/119059 there is currently no ethernet connection for the guest system
  • Follow-up with suggestions from #119059#note-47
    1. Physically connect keyboard+display(or serial)+storage to the machine, install Linux and enable remote IPMI
    2. Connect back to HMC, install VM and try to enable remote ipmi to the bare-metal machine and then set back to OPAL and continue with bare-metal Linux installation (preferred by cdywan, sriedel, nicksinger)
    3. Follow up with the discussions with IBM in bugzilla to find out what is the right way to install remote controlled bare-metal in OPAL mode (preferred by mkittler, okurz)
    4. As discussed in #119059#note-43 as alternative: we could continue with the HMC mode. This would likely mean setting up an HMC within the o3 network (or making HMC 3 somehow accessible).
  • Try to login over ssh or virt-manager or reinstall in case there is no or no usable OS on it
  • Install openQA-worker on the machine
  • Create SUSE-IT EngInfra ticket to move the machine's ethernet interface based on racktables information to the o3 network (VLAN 662)
  • Add according networking information on o3 into /etc/dnsmasq.d/openqa.conf
  • Configure the machine OS for dhcp client mode and ensure the machine gets an address from o3 dnsmasq
  • Ensure the system is able to execute openQA tests from o3

Related issues 4 (2 open2 closed)

Related to openQA Infrastructure - action #122302: Support SD-105827 "PowerPC often fails to boot from network with 'error: time out opening'"Resolvednicksinger2022-12-212023-02-02

Actions
Copied to openQA Infrastructure - action #119755: Use a PowerVM machine to serve both PowerVM LPARs for testing as well as one VM running qemu testsNew2022-11-02

Actions
Copied to openQA Infrastructure - action #123712: PowerVM HMC within o3, e.g. on VMNew2023-01-26

Actions
Copied to openQA Infrastructure - action #125216: Use qa-power8 for ppc tests in o3 - try one of the suggestions size:MResolvedokurz2023-03-01

Actions
Actions #3

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #4

Updated by okurz over 1 year ago

  • Tags set to next-office-day

According to BMC the machine is on. According to racktables https://racktables.nue.suse.com/index.php?page=object&object_id=2350 there should be an IPv4 address 10.162.6.170 (ps64mm1070.qa.suse.de) but there is no response on ping. According to racktables there might not be any ethernet connection. The BMC menu "network access" mentions an IPv4 address 10.163.41.238 which does respond on ping and also serves ssh. I did not manage to login over ssh with any password known to me.

nmap -O for OS detection says:

$ sudo nmap -AO 10.163.41.238
Starting Nmap 7.92 ( https://nmap.org ) at 2022-10-19 12:11 CEST
Nmap scan report for 10.163.41.238
Host is up (0.000075s latency).
Not shown: 997 closed tcp ports (reset)
PORT   STATE SERVICE    VERSION
22/tcp open  ssh        OpenSSH 8.4 (protocol 2.0)
| ssh-hostkey: 
|   3072 8f:8f:9c:6d:ae:30:0d:e1:65:1a:c5:0f:2e:21:65:16 (RSA)
|   256 9c:d3:c1:7e:ce:e7:8c:93:0c:af:b6:24:60:95:59:88 (ECDSA)
|_  256 d0:31:38:c4:41:f7:78:14:da:be:a8:f3:b0:ea:ea:55 (ED25519)
25/tcp open  smtp       Postfix smtpd
|_smtp-commands: localhost, PIPELINING, SIZE, ETRN, ENHANCEDSTATUSCODES, 8BITMIME, DSN, SMTPUTF8, CHUNKING
53/tcp open  tcpwrapped
Device type: general purpose
Running: Linux 2.6.X
OS CPE: cpe:/o:linux:linux_kernel:2.6.32
OS details: Linux 2.6.32
Network Distance: 0 hops
Service Info: Host: localhost

OS and Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 21.56 seconds

nsinger said that the linux is likely 4.X

Actions #5

Updated by okurz over 1 year ago

  • Assignee deleted (okurz)

So next step could be to learn how to install something, e.g. PXE boot or so.

Actions #6

Updated by okurz over 1 year ago

ok, we learned that the reported IPv4 address 10.163.41.238 is actually my own so not helping.

Actions #7

Updated by tinita over 1 year ago

  • Subject changed from Use qa-power8 for ppc tests in o3 to Use qa-power8 for ppc tests in o3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by okurz over 1 year ago

I asked in #eng-testing as well as #help-it-ama

(Oliver Kurz) anyone being in or going to Nbg Maxtorhof SRV2 now/soon who could ensure that there is a network connection for QA-Power8 ? Or do you know of a better channel where to coordinate such work?

If somebody in the meantime or if there is no answer stops by one can still ensure that there is a cable connection.

Actions #9

Updated by okurz over 1 year ago

acarvajal found that there is a physical connection but he did not remember on which switch port the cable is connected.

Actions #10

Updated by okurz over 1 year ago

  • Tags deleted (next-office-day)

no, sorry. IIRC it was the switch at the top of the rack, the row of ports below and either the 5th or 6th port from the left when seeing it from the back

With that and IPMI SoL access that should be enough to make the machine useable again without needing physical access

Actions #11

Updated by mkittler over 1 year ago

  • Description updated (diff)

This comment is actually irrelevant, it is about the o3 worker (which #116078 is about).


With that

Ok, so the 3rd suggestion is not relevant after all.

and IPMI SoL access

I still cannot connect via IPMI:

ipmitool -I lanplus -C 3 -H openqaworker-power8-ipmi.suse.de -U … -P … power status
Error in open session response message : insufficient resources for session

Error: Unable to establish IPMI v2 / RMCP+ session

Broken IPMI access is the big problem here. Powering the machine on and off and many more things I've already tried via the BMC didn't help.

Actions #12

Updated by okurz over 1 year ago

We can also try make use of the HMC control servers, e.g. "powerhmc1.arch.suse.de" or something

Actions #13

Updated by okurz over 1 year ago

We added the machine to both powerhmc1.arch.suse.de and powerhmc3.arch.suse.de and in both cases we can configure the "partitions", i.e. virtual machines, on the PowerVM managed machines. The mode between OPAL – for bare-metal installations – and PowerVM can be switched in ASM, in this case https://qa-power8.qa.suse.de in the menu "Firmware Configuration". For this no HMC connection must be configured. As necessary remove the connection in the HMC and in the ASM menu "Hardware Management Consoles". The problem is that ipmi seems to be not configured to allow remote access so we would need an OS installation on the host first to call ipmitool locally to allow remote access. As alternative we should try to use one of the VMs or create a new VM and try to enable remote access ipmitool.

So the suggestion is: (Re-)install one of the LPAR VMs and try to call ipmitool to allow remote access.

Actions #14

Updated by okurz over 1 year ago

  • Copied to action #119755: Use a PowerVM machine to serve both PowerVM LPARs for testing as well as one VM running qemu tests added
Actions #15

Updated by mkittler over 1 year ago

What are the ASM credentials for https://qa-power8.qa.suse.de? I tried multiple combinations of standard ones we use elsewhere but nothing worked.

Actions #16

Updated by okurz over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

mkittler wrote:

What are the ASM credentials for https://qa-power8.qa.suse.de? I tried multiple combinations of standard ones we use elsewhere but nothing worked.

provided in chat.

I am now trying myself. First over https://powerhmc3.arch.suse.de/ deleted "doc_vm" and "kdump" to free some ressources, then creating new LPAR.

Following a "template" I created an LPAR named "aixlinux1" using a wizard and selected "Network boot" pointing to 10.162.0.1, qanet.qa.suse.de, using an IPv4 address 10.162.6.80 as from https://gitlab.suse.de/qa-sle/qanet-configs/-/blob/master/etc/dhcpd.conf#L439 assuming that this one is currently unused. The wizard eventually showed an error "REST0293 Error while trying to retrieve Network Adapters: lpar_netboot: The network boot ended in an error."

Then I asked around and got a hint in https://suse.slack.com/archives/C02CANHLANP/p1668603577713149?thread_ts=1668602055.990499&cid=C02CANHLANP from szarate to check instructions from
https://confluence.suse.com/display/qasle/Power8+and+Power9+%28PowerVM%29+on+SLES
From there I also learned that one can connect to an HMC over ssh. I could login over ssh as okurz@powerhmc3.arch.suse.de but I could not add my ssh key. Anyway I considered the terminal access over this path more convenient with just mkvterm -m qa-power8 -p aixlinux1 on powerhmc3.arch.suse.de. With that me with the help of szarate tried multiple approaches, reconfiguring the network devices and such but did not see any DHCP or BOOTP requests on qanet.qa.suse.de . So maybe the ethernet ports are really not connected to a switch.
Also tried to hardcode network settings including gateway 10.162.63.254 based on the template we have seen on "redcurrant".

szarate and me come to the conclusion that likely the network is really not connected, not the BMC ones but the ethernet for the system. So this needs to be checked on "next-office-day" at Maxtorhof.

Actions #17

Updated by okurz over 1 year ago

  • Tags set to next-office-day
  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)
Actions #18

Updated by okurz over 1 year ago

  • Project changed from 46 to openQA Infrastructure
Actions #19

Updated by okurz over 1 year ago

  • Project changed from openQA Infrastructure to 46
  • Subject changed from Use qa-power8 for ppc tests in o3 size:M to [next-office-day] Use qa-power8 for ppc tests in o3 size:M
Actions #20

Updated by okurz over 1 year ago

  • Subject changed from [next-office-day] Use qa-power8 for ppc tests in o3 size:M to [next-office-day] Use qa-power8 for ppc tests in o3 - network connected? size:M
Actions #21

Updated by okurz over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #22

Updated by okurz over 1 year ago

I checked the physical network connections. Updated racktables with entries for the two quad-ethernet access cards. There was already one connection to C10-1 (top) connector, gi30. Updated racktables accordingly. gi30 on qanet15nue.qa.suse.de is in VLAN12 so should be reachable in QA network. Connected a second connection C12-2 (second card on the right, second connector from top) to gi8, added to VLAN12 as well. And connected C10-4 (bottom) to gi2, VLAN12.

Actions #23

Updated by okurz over 1 year ago

  • Tags deleted (next-office-day)
  • Subject changed from [next-office-day] Use qa-power8 for ppc tests in o3 - network connected? size:M to Use qa-power8 for ppc tests in o3 - network connected? size:M
  • Due date set to 2022-12-02
  • Status changed from In Progress to Feedback

https://suse.slack.com/archives/C02CANHLANP/p1669026762157029?thread_ts=1668602055.990499&cid=C02CANHLANP

(Oliver Kurz) I have checked the machine QA-Power8 aka ps64mm1070 physically and I can confirm that it has ethernet connections. Moreover I connected two additional network cables now, see https://racktables.nue.suse.com/index.php?page=object&tab=ports&object_id=2350 . @Santiago Zarate @Zaoliang Luo can you help with setting up the network for LPARs?

Actions #24

Updated by okurz about 1 year ago

no response yet, asked again

Actions #25

Updated by okurz about 1 year ago

  • Status changed from Feedback to In Progress

I will look into setting up LPARs with network on qa-power8

Actions #26

Updated by okurz about 1 year ago

https://www.youtube.com/@nigelargriffiths could be helpful. as a reference for "training". In the meantime mgriessmeier and me have experimented and compared the setup on qa-power8 with other machines, e.g. "redcurrant" and "blackcurrant". with the views of the "virtual network topology" it looked like on other machines there is a proper network connection from the physical network all down to the LPARs, on qa-power8 there isn't. We managed to install the VIOS instance. First we tried the internal network connection (192.168.X.X) and a VIOS version which failed to deploy, then we selected the external network connection (10.X.X.X) and select a VIOS version with "SP" in the name. This eventually got stuck. Then we connected over ssh to powerhmc1, text console to the vios instance. That showed a progress of 0% and an error message that no volume group could be found with a waiting time of 105 minutes. We just tried to do something useful in the system, typed "q", "quit", "exit" and then suddenly the installation continued and progressed successfully till the end. Then in HMC the vios instance had a "RMC" connection. Then in the LPAR we could select the "bridged" network, configure an entry in the dhcp config on qanet including PXE, and then we could boot over network from a manual grub prompt using the existing outdated PXE boot entries as template with linux+initrd specified from relative paths to what we could find on qanet. We verified that we could access the YaST installer over ssh but did not progress with the installation. I suggest to redo all the above steps, could even be done on power8.openqanet.opensuse.org, and document again with all knowledge holes filled.

Actions #27

Updated by okurz about 1 year ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)
Actions #28

Updated by livdywan about 1 year ago

  • Due date deleted (2022-12-02)

No due date for unassigned tickets

Actions #29

Updated by okurz about 1 year ago

  • Project changed from 46 to openQA Infrastructure
Actions #30

Updated by okurz about 1 year ago

  • Related to action #122302: Support SD-105827 "PowerPC often fails to boot from network with 'error: time out opening'" added
Actions #31

Updated by livdywan about 1 year ago

Next steps:

  • Connect via SSH
  • Install an operating system
  • Make the machine usable as a worker

We could do that in our mob session on Thursday since this should be a good learning opportunity for those not familiar with the setup, or another ad-hoc slot where there's more people involved.

We'll probably need a follow-up after that to get the network setup correctly after the machine's installed. See #119008 for reference.

Actions #32

Updated by mkittler about 1 year ago

I looked into this ticket again, so here a short summary:

  • I was able to login on https://qa-power8.qa.suse.de. Looks like the machine is configured to be in PowerVM mode (which should also explain why IPMI access is not working).
  • The system also appears to be already powered on. Not sure how to connect via SSH, though (and thus also not sure how to proceed with this ticket).
  • I have access to https://powerhmc3.arch.suse.de (not https://powerhmc1.arch.suse.de). It shows the host qa-power8 and I can even open the terminal for the test VM that's present. I'm still not sure how to proceed from there to install an OS. Maybe it helps to connect to the HMC via SSH. What that meant with "Connect via SSH" in the previous comment?
  • I've read #119059#note-16. So this has already been attempted but the network didn't work. After #119059#note-23 and #119059#note-26 this shouldn't be a problem anymore.
  • I could theoretically try to follow-up on #119059#note-26 but the details (which are left to be documented) aren't clear to me. The comment also mentions hmc1 but I have only access to hmc3.

We could do that in our mob session on Thursday since this should be a good learning opportunity for those not familiar with the setup, or another ad-hoc slot where there's more people involved.

This would be a good idea. I feel a bit lost here.

Actions #33

Updated by okurz about 1 year ago

So first: It shouldn't technically matter if we use hmc1 or hmc3 but in the future hmc3 is to be used as we have full control over it and everybody has their own account. And I recommend you try to connect to the serial terminal by connecting to the HMC over ssh and then call the mkvterm commands to connect to the terminal. Regarding "ssh" to the final system I what I did was start an installation with Linuxrc parameter ssh=1 and then connect to the SUT, not the HMC

Actions #34

Updated by okurz about 1 year ago

Actions #35

Updated by okurz about 1 year ago

We took another go at this story and connected with ssh to powerhmc3.arch.suse.de, rebooted the machine in SMS and selected network boot ending up in the grub menu served by qanet. We could continue from here, install something like a current Leap 15.4 and similar as for grenache-1 we could try to use nested kvm on that. But to be able to use the machine within o3 we also need an HVM there. Consider this out of scope for the current ticket, see #123712 for that.

I disconnected qa-power8 from HMC within https://powerhmc3.arch.suse.de/dashboard/#resources/systems so that we can switch to OPAL mode in the ASM and try to have IPMI and then continue with an installation of bare-metal Leap and then use as openQA kvm worker and then move to o3 network.

Actions #36

Updated by mkittler about 1 year ago

I disconnected qa-power8 from HMC within https://powerhmc3.arch.suse.de/dashboard/#resources/systems

Somehow that's not recognized by the ASM. It still says "Changing the firmware type is only allowed when the system is in powered off state and not managed by a management console." despite the system being powered off for quite a while now. That also means I could not enable IPMI so far. I hope I've been using the right ASM. (I have been using https://qa-power8.qa.suse.de as mentioned in the ticket description.)

Actions #37

Updated by okurz about 1 year ago

Well, then I suggest to reconnect with the HMC, ensure it's powered off and then disconnect again

Actions #38

Updated by mkittler about 1 year ago

I connected again via https://powerhmc3.arch.suse.de. Then I disconnected all connections to other HMCs that were still shown (there were 2 of them). Then I disconnected it from powerhmc3 itself. Unfortunately this didn't help. One can actually list the connected HMCs in the ASM and it says that the following are still connected: 10.161.26.36, 172.16.0.1
There was actually a 3rd one which I've removed. The remaining two aren't removable, though. Hence I'm not sure how to proceed. We somehow need to remove those (stale) entries.

Actions #39

Updated by livdywan about 1 year ago

The remaining two aren't removable, though. Hence I'm not sure how to proceed. We somehow need to remove those (stale) entries.

Let's check it in the infra call. Maybe Nick knows?

Actions #40

Updated by okurz about 1 year ago

https://powerhmc1.arch.suse.de/dashboard/ is currently "Service Unavailable". The ASM shows that there is still a connection to powerhmc1.arch.suse.de by IPv4 address. mkittler could remove some connections with a checkbox but not for hmc1. mkittler has tried "soft-reset" of the service processor. We now tried the "reset", i.e. "hard reset". right now https://qa-power8.qa.suse.de is not reachable anymore. Maybe after reset and with hmc1 being broken the connection vanishes? Nope, the connection still show up. But I could still login over ssh to powerhmc1.arch.suse.de . I logged in using ssh hscroot@powerhmc1.arch.suse.de using credentials from https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L1198
and did

rmsysconn -m qa-power8 -o remove

After that in the ASM https://qa-power8.qa.suse.de/ in the menu "Hardware Management Consoles" I could remove the one connection using the checkbox. But there is still a connection "V467f48*7044a59" with IPv4 address 172.16.0.1 which according to nsinger is the "PowerVM HMC network" but we don't know which HMC that is. I guess the address ending with a .1 still refers to powerhmc1. That would be my guess. Right now powerhmc1.arch.suse.de is rebooting. I suggest to try again at a later time to login over ssh to powerhmc1 and if that does not help then open a ticket and ask in more channels.

Actions #41

Updated by livdywan about 1 year ago

I think this is actually this ticket, hence adding the bugref to https://openqa.suse.de/tests/10528653:

[2023-02-17T19:05:46.030419+01:00] [debug] [pid:575072] Could not connect to hscroot@powerhmc1.arch.suse.de, Retrying after some seconds...
[2023-02-17T19:05:56.033590+01:00] [info] [pid:575072] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  Error connecting to <hscroot@powerhmc1.arch.suse.de>: Connection timed out
Actions #42

Updated by livdywan 12 months ago

See also #124391 regarding powerhmc1.arch.suse.de being unavailable, and discussions in Slack

Actions #43

Updated by mkittler 12 months ago

  • Subject changed from Use qa-power8 for ppc tests in o3 - network connected? size:M to Use qa-power8 for ppc tests in o3 - network connected?

In the last mob session @okurz wanted to set this up as barematel machine. We still struggle removing one HMC connection preventing us to continue with that approach.

Alternatively, we could continue with the HMC mode. This would likely mean setting up an HMC within the o3 network (or making HMC 3 somehow accessible).

Actions #44

Updated by mkittler 12 months ago

  • Description updated (diff)
Actions #45

Updated by nicksinger 12 months ago

I logged into powerhmc2 over ssh and did a ´cat /proc/net/fib_trie´ to confirm that hmc2 is indeed 172.16.0.1. Afterwards I could do another ´rmsysconn -m qa-power8 -o remove´. The option to change firmware mode was still not enabled but the "Hardware Management Consoles" where previous connections where displayed now showed that no HMCs are connected anymore and gave a quick guide how to "Reset the server to a non-HMC managed configuration". For this you need to go to "System Service Aids", select "Factory Configuration" and use the option "Reset server firmware settings". After this go back to "Hardware Management Consoles" and it now shows a button "Reset the server to a non-HMC managed configuration" which you can click. The ASM confirms with "Operation was successful. This Server is no longer managed by an HMC". Now the "Firmware Configuration" can be changed from "PowerVM" to "OPAL"

Actions #46

Updated by mkittler 12 months ago

Looks like you also already have changed the config to "OPAL" and you've also set the console type to IPMI. Unfortunately I still wasn't able to connect via IPMI, e.g. ipmitool -I lanplus -C 3 -H qa-power8.qa.suse.de -U … -P … power status didn't work. Looks like this host is in the same state as the other o3 worker power8. I've also tried to set the IPMI password again (to be identical to the ASM password) but it didn't work.

Actions #47

Updated by okurz 12 months ago

  • Copied to action #125216: Use qa-power8 for ppc tests in o3 - try one of the suggestions size:M added
Actions #48

Updated by okurz 12 months ago

  • Tags changed from infra, mob to infra
  • Description updated (diff)
  • Status changed from Workable to Blocked
  • Assignee set to okurz

With #119059#note-45 we now know how to set a machine to OPAL even if it refuses.

We see some options, updated the description and created two subtasks.

Actions #49

Updated by openqa_review 11 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_textmode_hmc_ntlm
https://openqa.suse.de/tests/10695007#step/bootloader/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #50

Updated by okurz 11 months ago

blocked by #125216

Actions #51

Updated by livdywan 10 months ago

okurz wrote:

blocked by #125216

This was unblocked more recently, and I'm getting started on the evaluation. Given the priority we might also re-consider e.g. pick the easiest to setup and maintain depending on how strongly stakeholders feel about the options. Just a thought I would consider anyway, but let's see what we find.

Actions #52

Updated by okurz 9 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
  • Priority changed from High to Low
  • Target version changed from Ready to future
Actions #53

Updated by okurz 9 months ago

  • Parent task changed from #109743 to #129499
Actions

Also available in: Atom PDF