Project

General

Profile

Actions

action #116078

closed

coordination #118642: [epic] PPC testing capabilities in openqa.opensuse.org

Recover o3 worker kerosene formerly known as power8, restore IPMI access size:M

Added by mkittler over 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2022-08-31
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

After a firmware update the IPMI access broke. It booted once successfully but now it is stuck on boot and we needed IPMI access to recover it.

Acceptance criteria

  • AC1: DONE IPMI access has been restored
  • AC2: ppc64le qemu openQA jobs on openqa.opensuse.org are executed again as expected

Context

Suggestions

  • DONE Follow suggestions from https://bugzilla.opensuse.org/show_bug.cgi?id=1202275 and https://mailman.suse.de/mlarch/SuSE/research/2022/research.2022.09/msg00047.html
  • DONE Write an email, raise an alarm, state that we don't have any ppc tests running for openSUSE right now AT ALL!!!1 (factory first!!!) We need help of someone that can setup powerPC machines and at best has direct access to SUSE Nbg Maxtorhof SRV1. If there is no one then escalate and question if SUSE can support this at all for the future
  • DONE As alternative coordinate access to SRV1 for ourselves and try to fix it in place -> the machine is in FC Basement meanwhile with easier access
  • Boot via recovery media over PXE.
  • DONE Attach a "Virtual CD" which can (IIRC) be attached over the HMC.

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #115094: [tools] test fails in bootloader_start: redcurrant is down size:MResolvednicksinger2022-08-08

Actions
Related to openQA Infrastructure - action #114565: recover qa-power8-4+qa-power8-5 size:MResolvedokurz2022-12-19

Actions
Related to openQA Infrastructure - action #138038: diesel+petrol missing network, IPMI still reachableResolvedokurz2023-10-16

Actions
Actions #1

Updated by livdywan over 1 year ago

  • Project changed from openQA Infrastructure to 175
  • Description updated (diff)
Actions #2

Updated by livdywan over 1 year ago

  • Subject changed from Recover o3 worker power8, restore IPMI access to Recover o3 worker power8, restore IPMI access size:undecided

We agreed to not estimate it yet but raise it in the sync call first. Currently we can't work on it anyway due to the AC situation

Actions #3

Updated by jstehlik over 1 year ago

Take a look on unused Power8 machines in server room 2 .. probably rack 1 - Matthias will add more details

Actions #4

Updated by jstehlik over 1 year ago

  • Related to action #115094: [tools] test fails in bootloader_start: redcurrant is down size:M added
Actions #6

Updated by livdywan over 1 year ago

  • Due date set to 2022-10-28
  • Status changed from New to Blocked
  • Assignee set to livdywan

Blocking on availability of the HMC

Actions #7

Updated by okurz over 1 year ago

  • Project changed from 175 to openQA Infrastructure
  • Subject changed from Recover o3 worker power8, restore IPMI access size:undecided to Recover o3 worker power8, restore IPMI access
  • Status changed from Blocked to Feedback
  • Priority changed from Normal to High

The ticket was set to "Blocked" but https://sd.suse.com/servicedesk/customer/portal/1/SD-97044 is set to "Closed". As discussed in weekly QE sync with jstehlik and also cdywan we should add it back to the "openQA Infrastructure" ticket project and also not set it as "Blocked" because work from our side is needed.

Actions #8

Updated by livdywan over 1 year ago

As discussed I sent an email to research@suse.de to find out what options we have here.

Actions #9

Updated by okurz over 1 year ago

  • Subject changed from Recover o3 worker power8, restore IPMI access to Recover o3 worker power8, restore IPMI access size:M
  • Description updated (diff)
  • Status changed from Feedback to Workable
  • Assignee deleted (livdywan)

https://mailman.suse.de/mlarch/SuSE/research/2022/research.2022.09/msg00047.html is the email and there is already a response. We estimated the ticket in the weekly QE Tools estimation call 2022-09-15

Actions #10

Updated by mkittler over 1 year ago

  • Due date changed from 2022-10-28 to 2022-11-02
  • Status changed from Workable to Feedback
  • Assignee set to nicksinger
Actions #11

Updated by nicksinger over 1 year ago

I've asked Hector for help to reach out to somebody from IBM: https://suse.slack.com/archives/C02CANHLANP/p1664464326700249

Actions #12

Updated by livdywan over 1 year ago

nicksinger wrote:

I've asked Hector for help to reach out to somebody from IBM: https://suse.slack.com/archives/C02CANHLANP/p1664464326700249

Hector is suggesting we open a bugzilla with an explanation, versions used and references to issues to be mirrored on the IBM side

Actions #13

Updated by okurz over 1 year ago

  • Parent task set to #118642
Actions #14

Updated by nicksinger over 1 year ago

  • Status changed from Feedback to Blocked

cdywan wrote:

nicksinger wrote:

I've asked Hector for help to reach out to somebody from IBM: https://suse.slack.com/archives/C02CANHLANP/p1664464326700249

Hector is suggesting we open a bugzilla with an explanation, versions used and references to issues to be mirrored on the IBM side

Not sure what the process is but I created https://bugzilla.opensuse.org/show_bug.cgi?id=1204277 now. Hope this makes it way to IBM somehow.
In general this now raised an upcoming discussion on how we want to proceed with power testing for openSUSE (see https://suse.slack.com/archives/C02CANHLANP/p1665658811414529?thread_ts=1664464326.700249&cid=C02CANHLANP)

Actions #15

Updated by okurz over 1 year ago

  • Status changed from Blocked to In Progress

We logged in over https://powerhmc1.arch.suse.de and clicked on "Connect Systems…" and added the machine openqaworker-power8-ipmi.suse.de with the credentials as in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L55 . The system appeared with an entry for IPv4 address 10.160.1.104 but the message "Version mismatch". Details: "This version of the management console is unable to manage the system. Update the management console Licensed Internal Code. For information on updating Licensed Internal Code, see 'Update HMC' in the Operations Guide for the HMC and Managed Systems, SA76-0085.". nicksinger suggests to use powerhmc3 for which we do not have credentials.

So @nicksinger please ask aeisner for credentials or an update of powerhmc1.

Actions #16

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Blocked

Ticket for a user in powerhmc3 created: https://sd.suse.com/servicedesk/customer/portal/1/SD-101953

Actions #17

Updated by okurz over 1 year ago

  • Priority changed from High to Urgent

Becoming urgent now with external users asking for this and more time without no ppc64le testing going on. I extended the request in https://sd.suse.com/servicedesk/customer/portal/1/SD-101953

Actions #18

Updated by okurz over 1 year ago

https://sd.suse.com/servicedesk/customer/portal/1/SD-101953 received an update. According to aeisner we have credentials on the system for all current team members. I tried to connect systems qa-power8, qa-power8-3 and the o3 machine power8-ipmi but I can only see two machines "haldir" and "legolas". I assume that we are members of some "QA-SAP" group which would not make sense. I put a comment into the ticket and waiting for aeisner's response.

Actions #19

Updated by okurz over 1 year ago

  • Due date changed from 2022-11-02 to 2022-11-11
  • Assignee changed from nicksinger to okurz
Actions #20

Updated by okurz over 1 year ago

  • Status changed from Blocked to In Progress
  • Assignee changed from okurz to nicksinger

As discussed with aeisner apparently it’s not possible to upgrade the HMC nor the internal licenses. We agreed that with the team SUSE QE Tools having access to https://powerhmc3.arch.suse.de/ we can put all machines with newer firmware into there. Now we can try to get power8.openqanet.opensuse.org into one of the HMCs.

Actions #21

Updated by okurz over 1 year ago

  • Assignee changed from nicksinger to okurz
Actions #22

Updated by okurz over 1 year ago

I have added power8.openqanet.opensuse.org as well as qa-power8-3.qa.suse.de in both https://powerhmc1.arch.suse.de/ as well as https://powerhmc3.arch.suse.de/ . In both cases the machine shows up under the IPv4 address 10.160.1.104 with the error message as from #116078#note-15 . Maybe the firmware update as explained in https://bugzilla.opensuse.org/show_bug.cgi?id=1202275 was actually newer than the HMC install date of hmc3, installed 2021-12. So, can we update hmc3? First we tried to change from OPAL mode to PowerVM mode in ASM. Then there was a message about needing system password so we set a "new" system password, the one from salt pillars. Then we called the action to reset the system connection. Now the machine was correctly found in hmc3. We powered on the system. Also in hmc1 we reset the connection and the system showed up fine. But we could not find where to configure IPMI or something. We could connect to the console of the "system partition" over HMC and we were in a "SMS menu" where one could select boot entries and such but there we did not find any storage volume, possibly because that would need to be a virtualized storage container anyway. So we opted to remove the machine from both HMCs again but ASM still prevented us to change to OPAL mode. First in ASM one needs to change in "Hardware Management Consoles" to reset to a non-HMC managed instance. and change the firmware mode back to OPAL mode. Still can't connect over IPMI.

Actions #23

Updated by okurz over 1 year ago

  • Due date deleted (2022-11-11)
  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

system does not boot up. Please create ticket to ask Eng-Infra to connect monitor and keyboard to machine and check why it does not boot. Then as soon as someone can login configure IPMI access with local "ipmitool" invocation"

Actions #24

Updated by nicksinger over 1 year ago

  • Assignee set to nicksinger
Actions #25

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Blocked

Ticket created https://sd.suse.com/servicedesk/customer/portal/1/SD-103442 and waiting for answers now. Most likely we need to be there in person with somebody from eng-infra but lets see how it goes.

Actions #26

Updated by okurz over 1 year ago

  • Status changed from Blocked to In Progress

From the ticket:

As we chatted in slack, this weird powermachine has no option (display port vga etc) to direct connect it. sorry for not find a solution so far

I suggest to ask around to more people what we try to achieve, what has been done, what failed and ask what we should do then

Actions #27

Updated by openqa_review over 1 year ago

  • Due date set to 2022-11-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #28

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)

Quick recap of the last days: physical installation/recovery cannot be conducted because the machine misses any graphical output port.
We also came up with the idea to recovery/configure the IPMI over a lpar while the machine is managed by the HMC. I connected the machine to powerhmc3 but was not able to boot any recovery media over PXE. The only idea left is to try the same with a "Virtual CD" which can (IIRC) be attached over the HMC.

Actions #29

Updated by livdywan over 1 year ago

  • Description updated (diff)
Actions #30

Updated by mkittler over 1 year ago

I suggest to ask around to more people what we try to achieve, what has been done, what failed and ask what we should do then

We've already started a mailing list thread a while ago: https://mailman.suse.de/mlarch/SuSE/research/2022/research.2022.09/msg00047.html

The conclusion was to ask Hector: https://suse.slack.com/archives/C02CANHLANP/p1664464326700249 - This discussion switched to Power9 at some point. Hector eventually forwarded some information to IBM but it was apparently not of any help. As far I as can see, we haven't gained any new information on how to restore IPMI access of the machine.

Actions #31

Updated by okurz over 1 year ago

Regarding PowerPC management, everyone should be able to login on https://powerhmc3.arch.suse.de/

Actions #32

Updated by okurz over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

As discussed in the weekly QE sync meeting I will try to escalate the issue and bring in more help. If there is nobody available at the company providing reasonable guidance then we for sure have a problem :D

Actions #33

Updated by okurz over 1 year ago

  • Due date deleted (2022-11-25)
  • Status changed from In Progress to Blocked

A draft to a message in #engineering:

Hi, we have repeatedly brought up the topic but I am trying again as we could not find a solution: We have PowerPC machines, Power8 in this example, …

first trying with #119059

Actions #35

Updated by okurz over 1 year ago

Helpful pointers from Michal Suchanek:

(Michal Suchanek) doesn't openQA use network boot on Power machines already?
This is an article that shows how to use low-level VIOS commands to create and manage virtual optical media and drives
https://www.ibm.com/support/pages/how-configure-vios-media-repositoryvirtual-media-library-ex-aix-installrestore
It's also posssible to configure on HMC https://www.ibm.com/support/pages/creating-virtual-optical-drive-hmc-managed-vios-partition
ibm.comibm.com
If you have bare metal machines that cannot be connected to HMC this (likely) applies https://www.ibm.com/docs/en/linux-on-systems?topic=systems-petitboot-bootloader
System can be booted directly from a network server, https://download.suse.de/install/SLP/ should do.
In OPAL mode the last one applies. You either have PowerVM LPARS or OPAL, not both
(Oliver Kurz) Can I see somewhere if the physical machine has a physical network connection besides the one used for management?
(Michal Suchanek) On a HMC-connected machine you would see it on HMC. All machines have some sort of hardware inventory available on the web management interface but the interface varies depending on product line.
(Oliver Kurz) would you like to check the machine "qa-power8" on https://powerhmc3.arch.suse.de/dashboard/#resources/systems yourself? I failed to find > (Michal Suchanek) The petitboot bootloader is Linux so if you boot into that it should have an option to enter shell, too
(Oliver Kurz) Yes, I used that to kexec, no problem if I would be able to reach that state. But for the machine"qa-power8" I am not even sure if network is connected and for a second machine power8.openqanet.opensuse.org the thing is according to https://www.ibm.com/docs/en/linux-on-systems?topic=qsglpsls-installing-ubuntu-power-system-lc-servers-using-network-boot we would need a monitor+keyboard at least for the initial setup after we lost remote IPMI after a firmware upgrade but according to Nick Singer in https://progress.opensuse.org/issues/116078#note-28 the machine does not have any graphical output port
(Michal Suchanek) Isn't there an option to use a physical serial port?
(Oliver Kurz) I would need to ask someone to get into that server room Nbg SRV1 and check that
(Michal Suchanek) If you are unsure that network is connected you need to ask people who connect the network. In petitboot you should be able to configure network card and see what MAC address it has.
(Michal Suchanek) the option to use physical serial port would have to be configurable in the web management [ASM]

Turned out to be available over "System Service Aids", "Serial Port Setup" but only if the machine is not managed by an HMC

Actions #37

Updated by okurz over 1 year ago

  • Tags set to next-office-day

Potentially we can move power8 out of NUE1 SRV1 and to Frankencampus where it might be easier to handle with non-restricted physical access

Actions #38

Updated by okurz over 1 year ago

  • Status changed from Blocked to Feedback
  • Assignee changed from okurz to nicksinger
  • Priority changed from Urgent to High

@nicksinger you wanted to wait for mmaher/gschlotter to tell you when they go to SRV1 and try with a thumb drive and serial port wasn't it?
yes, it will be our only way to access the machine currently

Actions #39

Updated by okurz over 1 year ago

still the current plan to find an appointment when to do that.

Actions #40

Updated by nicksinger over 1 year ago

Gerhard is back from vacation and we will visit SRV1 tomorrow (2023-01-10) to build out the machine so I can mount it in the TAM lab for further debugging/testing

Actions #41

Updated by nicksinger over 1 year ago

Machine is now mounted above merckx (racktables entries pending). USB2Seral connected to seth (ttyUSB0) and ASM reachable via https://10.162.29.184/cgi-bin/cgi. Network cables connected to qanet12:
port27 == HMC1
port3 == C10-1

Actions #42

Updated by okurz over 1 year ago

nicksinger wrote:

Machine is now mounted above merckx (racktables entries pending). USB2Seral connected to seth (ttyUSB0) and ASM reachable via https://10.162.29.184/cgi-bin/cgi. Network cables connected to qanet12:
port27 == HMC1
port3 == C10-1

Updated racktables but one point is not clear: Can you help me to relate C10-1 to one of the names on https://racktables.nue.suse.com/index.php?page=object&tab=ports&object_id=2012 please?

I now connected enP3p9s0f0 to qanet12-port3, same interface as was used in before, is this correct?

Actions #43

Updated by okurz over 1 year ago

@nicksinger when you happen to go to the lab you can take a look for the labels of the ports and crosscheck with https://racktables.nue.suse.com/index.php?page=object&tab=ports&object_id=2012 and then just delete the present connection, add labels and not care that much about the Linux internal interface names.

Actions #44

Updated by mkittler over 1 year ago

  • Blocks action #118024: Ensure all PPC workers are upgraded after kernel regression resolved size:M added
Actions #45

Updated by mkittler over 1 year ago

  • Blocks deleted (action #118024: Ensure all PPC workers are upgraded after kernel regression resolved size:M)
Actions #46

Updated by okurz over 1 year ago

  • Status changed from Feedback to Blocked
  • Assignee changed from nicksinger to okurz

Blocked by #119551

Actions #47

Updated by okurz over 1 year ago

  • Priority changed from High to Normal

But also lower prio as we have more ppc ressources now with qa-power8-3 and such

Actions #48

Updated by okurz about 1 year ago

  • Target version changed from Ready to future

something for later, tracking outside backlog.

Actions #49

Updated by okurz 12 months ago

  • Tags changed from next-office-day, infra to infra
  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

machine is now in FC Basement, racktables entry: https://racktables.nue.suse.com/index.php?page=object&object_id=2012

Network and power connections:

  • HMC1 HMC1 1000Base-T 98:BE:94:4B:8E:98 JEX34-FC-B-1 ge-0/0/18 yellow
  • pwr0 AC-in PDU-FC-B1 14-1
  • pwr1 AC-in PDU-FC-B1 14-2 enP1s9f2 1000Base-T 40:F2:E9:5D:36:A6 JEX34-FC-B-1 ge-0/0/15

so should be remote controllable

Actions #50

Updated by okurz 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to Urgent
  • Target version changed from future to Ready

There was an update in https://bugzilla.suse.com/show_bug.cgi?id=1204277#c17

I have a 8247-22L machine here running SV860_243 and I connected with celogin (not celogin1) and I have "ASMI Systems Configuration ->security -> External Services management" .
So let's try that :

  • make sure you run SV860_243 (on the top right ASMi web page)
  • then try connecting with celogin and the corresponding temporary password depending on day you try that : 07/05/2023 F6148F4D 07/06/2023 59492173 07/07/2023 F93A1770

Currently we only have a dynamic DHCP lease but the ASM is reachable over https://10.168.193.46/

As stated in https://bugzilla.suse.com/show_bug.cgi?id=1204277#c9 the machine has FW840.00 (SV840_043), not SV860_243. I tried to login with "celogin" and the temporary password of today F6148F4D but couldn't login. I logged in with our admin account and found:

User ID Status
dev Disabled
celogin Disabled
celogin1    Enabled
celogin2    Disabled

I now logged in with celogin1/celogin1 but could not find any additional menu entry that nicksinger mentioned in https://bugzilla.suse.com/show_bug.cgi?id=1204277#c12 . This is time critical as we only have temporary passwords for today, tomorrow and the day after tomorrow

Actions #51

Updated by openqa_review 10 months ago

  • Due date set to 2023-07-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #52

Updated by okurz 10 months ago

nicksinger logged in as "admin" into the ASM and then in "Login Profile -> User Policy" he could enable the "celogin" account. Then we could login into the ASM with the "celogin" account and the temporary password of the day and see the option "System Service Aids -> Service Processor Command Line". nicksinger tried "iptables -F" to prevent any network blocks but we saw no effect of that, i.e. "insufficient user privileges" or something like that. But we are still running the old firmware version 840. In https://bugzilla.opensuse.org/show_bug.cgi?id=1202275#c20 mkittler mentioned

New T Image:     SV860_243
New P Image:     SV840_043

so it looks like the firmware was upgraded but only for "T". According to "Power/Restart Control -> Power On/Off System" one can select "Temporary" or "Permanent" firmware for the next boot. We selected "Temporary" and then saved settings and reset the service processor using "System Service Aids -> Reset Service Processor". https://10.168.193.46/ now shows "https://10.168.193.46/cgi-bin/cgi". Now "System Configuration" shows "Security". In there in "External Services Management" we could find a pull-down menu for IPMI which was set to "Disabled" and we set it to "Enabled" and saved. Now with ipmi we could not connect. But over "Login Profile -> Change Password" we changed for user id "IPMI", e.g. to "admin" and then tried to authenticate over ipmitool with "ADMIN" or "admin" but did not succeed though it looks like the server is responding, just that we can't authenticate.

EDIT: nicksinger succeeded in finding out what we need to do. If we just specify a password and no username on ipmitool then we can connect, e.g. ipmitool -C 3 -I lanplus -H 10.168.193.46 -P XXX sol activate. We set the default openQA worker password from https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls now. We could see the system booting up and reaching petitboot. The system does not try to boot any option by default and our PXE boot setup on qa-jump.qe.nue2.suse.org does not currently support ppc. Also we did not find any permanent storage like hdd drives. However what we can do temporarily is exit to the petitboot shell and use kexec to load a usable system so what we did for now:

cd /tmp
wget http://download.opensuse.org/ports/ppc/distribution/leap/15.4/repo/oss/boot/ppc64le/linux
wget http://download.opensuse.org/ports/ppc/distribution/leap/15.4/repo/oss/boot/ppc64le/initrd
kexec -l linux --initrd initrd --command-line 'install=http://download.opensuse.org/ports/ppc/distribution/leap/15.4/repo/oss/ sshd=1 sshpassword=XXX rootpassword=XXX console=tty0 console=hvc0 rescue=1'

The rescue system booted and we could login. Surprisingly also there no storage devices could be found.

Actions #53

Updated by okurz 10 months ago

  • Tags changed from infra to infra, next-frankencampus-visit

More conversation in https://bugzilla.suse.com/show_bug.cgi?id=1204277 regarding the missing storage devices. The current hypothesis is that during the last physical move of the machine the storage controller was actually unseated. I plan to check the machine on "next-frankencampus-visit"

Actions #54

Updated by okurz 10 months ago

  • Tags changed from infra, next-frankencampus-visit to infra
  • Due date deleted (2023-07-20)
  • Status changed from In Progress to Blocked

I physically removed the machine from the rack, opened the case and checked proper seating of all extension cards and components. I removed the SAS controller and cables and re-attached everything and put the machine back into the rack. But the storage controller does not show up in the system. I added a comment in https://bugzilla.suse.com/show_bug.cgi?id=1204277#c24

Actions #55

Updated by okurz 10 months ago

  • Tags changed from infra to infra, next-frankencampus-visit
  • Status changed from Blocked to Workable
  • Priority changed from Urgent to High

I will need to check the physical machine the next time I am in FC Basement and check the controller if a possible replacement (see email in my "todo" folder) would be compatible.

Actions #56

Updated by okurz 10 months ago

  • Tags changed from infra, next-frankencampus-visit to infra
  • Status changed from Workable to Feedback

Did that and answered Frederic regarding the adapter. Waiting for delivery

Actions #57

Updated by okurz 9 months ago

  • Priority changed from High to Low
  • Target version changed from Ready to future

tracking that outside the backlog. Either we receive replacement hardware or not.

Actions #58

Updated by okurz 7 months ago

  • Status changed from Feedback to Workable
  • Target version changed from future to Ready

https://suse.slack.com/archives/C02CANHLANP/p1695308268615669?thread_ts=1668781543.104029&cid=C02CANHLANP
Adapter has arrived in FC. I will pick it up next Monday and see what I can do

Actions #59

Updated by livdywan 7 months ago

  • Description updated (diff)
Actions #60

Updated by okurz 7 months ago

  • Status changed from Workable to In Progress

Replacement obtained. Built in. System reconnected. Over SoL I could see in petitboot that the storage devices are found and the linux partition on it. The system then fails to boot, ends up in emergency mode, due to NFS from o3 (obviously) not mountable. Also there is likely no proper network connection configured. Also we don't have a static lease yet for the system ethernet configured. But now I decided to rename the system as "openqaworker-power8" is both ambiguous as well as long. So in line with "petrol" and "diesel" in https://racktables.nue.suse.com/index.php?page=rack&rack_id=19174 in the same rack, also Power8, I came up with "kerosene". According changes:

Actions #61

Updated by okurz 7 months ago

  • Status changed from In Progress to Blocked
Actions #62

Updated by okurz 7 months ago

  • Subject changed from Recover o3 worker power8, restore IPMI access size:M to Recover o3 worker kerosene formerly known as power8, restore IPMI access size:M
Actions #63

Updated by okurz 7 months ago

  • Status changed from Blocked to Workable

both merged. Next steps can be picked up.

Actions #64

Updated by livdywan 7 months ago

  • Description updated (diff)
Actions #65

Updated by okurz 7 months ago

  • Tags changed from infra to infra, next-frankencampus-visit
  • Status changed from Workable to In Progress

I could now connect using ipmitool -I lanplus -C 3 -H kerosene-sp.qe.nue2.suse.org -P …. Actually also without the -C3 works so ipmitool -I lanplus -H kerosene-sp.qe.nue2.suse.org -P …. System is still stuck in emergency mode. ip link shows no network link, same as dmesg | grep -i link. Triggered a reboot. If again no network comes up then I should crosscheck the network connection locally.

petitboot mentions

  [Network: eth7 / 40:f2:e9:5d:3d:3f]
    netboot eth7 (pxelinux.0)

and on walter1 I also saw

Sep 28 13:46:18 walter1 dhcpd[30520]: DHCPDISCOVER from 40:f2:e9:5d:3d:3f via eth0
Sep 28 13:46:19 walter1 dhcpd[30520]: DHCPOFFER on 10.168.195.95 to 40:f2:e9:5d:3d:3f via eth0
Sep 28 13:46:19 walter1 dhcpd[30520]: DHCPREQUEST for 10.168.195.95 (10.168.192.1) from 40:f2:e9:5d:3d:3f via eth0
Sep 28 13:46:19 walter1 dhcpd[30520]: DHCPACK on 10.168.195.95 to 40:f2:e9:5d:3d:3f via eth0

which according to racktables is enP3p9s0f3, the last interface there. Maybe I need to simply configure all interfaces. In emergency mode in /etc/sysconfig/network did ln -s ifcfg-enP3p9s0f0 ifcfg-enP3p9s0f3

Actions #66

Updated by okurz 7 months ago

  • Status changed from In Progress to Workable

Right now I can't even reach kerosene-sp.qe.nue2.suse.org. Maybe the weekend helps here :)

Actions #67

Updated by okurz 7 months ago

  • Description updated (diff)
Actions #68

Updated by okurz 7 months ago

System reachable again. I made a mistake in the reserved IPv4 address, fixed in https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4124 . Looked into the emergency mode with nicksinger and we found that the partition /dev/sda1 used for /var/lib/openqa is corrupt and needs a manually triggered fsck.ext4 -y /dev/sda1. After that we could call mount -a successfully. But we also made the fsck for /var/lib/openqa non-critical so that the system would continue boot on problems which would be more convenient. We saw that there is incomplete console output on boot so in /etc/default/grub we added console=hvc0 to the boot line. Also network config for the currently connected "eth7" was not there so we did cd /etc/sysconfig/network && ln -s ifcfg-{enP1p9s0f1,eth7}

Actions #69

Updated by okurz 7 months ago

  • Status changed from Workable to In Progress
Actions #70

Updated by okurz 7 months ago

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4124 to add an DHCP/DNS entry for the currently actually connected interface

In /etc/openqa/workers.ini we changed HOST to https://openqa.opensuse.org and then added a section

[https://openqa.opensuse.org]
TESTPOOLSERVER = -e 'ssh -p 2214 -i /var/lib/openqa/ssh/id_ed25519 -o "UserKnownHostsFile /var/lib/openqa/ssh/known_hosts"' geekotest@gate.opensuse.org:/var/lib/openqa/tests

created an according ssh key which we allowed on o3. Now jobs are picked up and worked on put missing deps, e.g. SemVer. Calling zypper dup.

Actions #71

Updated by okurz 7 months ago

Apparently there is no openSUSE Leap 15.5 ppc, reported https://bugzilla.opensuse.org/show_bug.cgi?id=1215874 . We can consider to move to Tumbleweed or SLE. I would try Tumbleweed. But for now let's stay on Leap 15.4. Regarding the missing dependencies I installed os-autoinst-distri-opensuse-deps. Now zypper dup wants to delete os-autoinst and openQA and such. I don't know why. zypper in --force openQA-auto-update openQA-client openQA-common openQA-continuous-update openQA-worker os-autoinst os-autoinst-swtpm perl-DBD-Pg perl-Mojo-Pg fixed it. Restarted first worker instance. Monitoring. https://openqa.opensuse.org/tests/3608908# looks good for now.

Actions #72

Updated by okurz 7 months ago

  • Status changed from In Progress to Blocked
Actions #73

Updated by okurz 7 months ago

  • Tags changed from infra, next-frankencampus-visit to infra
Actions #74

Updated by okurz 7 months ago

  • Status changed from Blocked to Workable

okurz wrote in #note-71:

Apparently there is no openSUSE Leap 15.5 ppc, reported https://bugzilla.opensuse.org/show_bug.cgi?id=1215874

Repo paths have just moved
https://bugzilla.suse.com/show_bug.cgi?id=1215874#c1

For the main repo I should be able to use https://download.opensuse.org/distribution/leap/15.5/repo/oss/ppc64le/

Actions #75

Updated by okurz 7 months ago

in /etc/zypp/repos.d/repo-oss.repo replaced the URL

-baseurl=http://download.opensuse.org/ports/ppc/distribution/leap/$releasever/repo/oss/
+baseurl=http://download.opensuse.org/distribution/leap/$releasever/repo/oss/ppc64le/
Actions #76

Updated by okurz 7 months ago

  • Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added
Actions #77

Updated by okurz 7 months ago

  • Status changed from Workable to In Progress

Upgraded to Leap 15.5 and install kernel backport as per #114565-85

Now jobs fail like https://openqa.opensuse.org/tests/3621045 with

[2023-10-04T18:08:03.405050Z] [debug] [pid:14059] QEMU: QEMU emulator version 7.1.0 (SUSE Linux Enterprise 15)
[2023-10-04T18:08:03.405123Z] [debug] [pid:14059] QEMU: Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
[2023-10-04T18:08:03.405173Z] [debug] [pid:14059] QEMU: Unknown host!
[2023-10-04T18:08:03.405222Z] [debug] [pid:14059] QEMU: Unknown host!
[2023-10-04T18:08:03.405270Z] [debug] [pid:14059] QEMU: Unknown host!
[2023-10-04T18:08:03.405321Z] [debug] [pid:14059] QEMU: KVM: Failed to create TCE64 table for liobn 0x80000000
[2023-10-04T18:08:03.405374Z] [debug] [pid:14059] QEMU: Unknown host!
[2023-10-04T18:08:03.405422Z] [debug] [pid:14059] QEMU: Unknown host!
[2023-10-04T18:08:03.405470Z] [debug] [pid:14059] QEMU: Unknown host!
[2023-10-04T18:08:03.405518Z] [debug] [pid:14059] QEMU: Unknown host!

I wonder if one of the other machines we have behaves the same on Leap 15.5. I found https://bugzilla.redhat.com/show_bug.cgi?id=1440619 about "KVM: Failed to create TCE64 table for liobn" which mentions a fix.

If I run manually

/usr/bin/qemu-system-ppc64 -device VGA,edid=on,xres=1024,yres=768 -g 1024x768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 4096 -machine usb=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off -cpu host -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:11 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -boot once=d -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 4 -enable-kvm -no-shutdown -vnc :77,share=force-shared

I get

KVM: Failed to create TCE64 table for liobn 0x71000002
KVM: Failed to create TCE64 table for liobn 0x80000000
error: kvm run failed Device or resource busy
This is probably because your SMT is enabled.
VCPU can only run on primary threads with all secondary threads offline.

#119008-13 provided some hint:

systemctl enable --now smt_off
echo 'kvm_hv' > /etc/modules-load.d/kvm.conf

triggered https://openqa.opensuse.org/tests/3621076#live

Actions #78

Updated by okurz 7 months ago

  • Due date set to 2023-10-18
  • Status changed from In Progress to Feedback

https://openqa.opensuse.org/tests/3621076 passed, looks good. Let this run for some time.

Actions #79

Updated by okurz 7 months ago

  • Status changed from Feedback to Workable

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4124 merged. Should check proper FQDNs resolved

Actions #80

Updated by okurz 7 months ago

  • Status changed from Workable to In Progress

I can't reach the machine over IPMI. Power cycling over PDU

Actions #81

Updated by okurz 7 months ago

From SoL

[   13.017096][ T1737] ses 0:0:8:0: Attached Enclosure device

Welcome to openSUSE Leap 15.5 - Kernel 5.14.21-150500.55.28-default (hvc0).

eth0:
eth1:
eth2:
eth3:
eth4:
eth5:
eth6:
eth7: 10.168.193.11 fe80::42f2:e9ff:fe5d:3d3f


kerosene-8 login: [  310.260509][  T101] Kernel panic - not syncing: corrupted stack end detected inside scheduler
[  310.260565][  T101] CPU: 0 PID: 101 Comm: kworker/dying Kdump: loaded Tainted: G        W      X  N 5.14.21-150500.55.28-default #1 SLE15-SP5 55c4465f644d5be5ce7c285d13ad6540a5a1471c
[  310.260604][  T101] Call Trace:
[  310.260619][  T101] [c000000006c13ac0] [c00000000088c900] dump_stack_lvl+0x74/0xa8 (unreliable)
[  310.260646][  T101] [c000000006c13b00] [c00000000015adb4] panic+0x180/0x42c
[  310.260665][  T101] [c000000006c13b90] [c000000000e0ba24] __schedule+0x13e4/0x1440
[  310.260685][  T101] [c000000006c13ca0] [c0000000001b2b34] do_task_dead+0x64/0x70
[  310.260705][  T101] [c000000006c13cd0] [c0000000001647b4] do_exit+0x864/0xb60
[  310.260723][  T101] [c000000006c13da0] [c000000000194ba4] kthread+0x154/0x1d0
[  310.260741][  T101] [c000000006c13e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64

the system should not run this recent kernel version. On diesel+petrol we also run downgraded older versions.

I must have applied the downgrade in a wrong way so what I did now:

for i in 3 2 1; do zypper rl $i; done
zypper in --oldpackage http://download.opensuse.org/update/leap/15.3/sle/ppc64le/kernel-default-5.3.18-150300.59.93.1.ppc64le.rpm http://download.opensuse.org/update/leap/15.3/sle/ppc64le/util-linux-2.36.2-150300.4.23.1.ppc64le.rpm
zypper rm kernel-default-5.14.21
for i in kernel-default util-linux; do zypper al --comment "poo#119008, kernel regression boo#1202138" $i; done
reboot
Actions #82

Updated by okurz 7 months ago

  • Status changed from In Progress to Feedback
Actions #83

Updated by okurz 7 months ago

  • Due date deleted (2023-10-18)
  • Status changed from Feedback to Resolved

monitoring MR merged, all good, resolved.

Actions #84

Updated by okurz 7 months ago

  • Related to action #138038: diesel+petrol missing network, IPMI still reachable added
Actions

Also available in: Atom PDF