Project

General

Profile

Actions

action #139115

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Ensure o3 openQA PowerPC machine qa-power8-3 is operational from PRG2 size:M

Added by okurz 6 months ago. Updated 24 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2023-06-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Most PowerPC machines are being setup in PRG2 within #132140 and most machines could be discovered from the HMC. qa-power8-3 is meant for o3 and likely needs more collaboration with SUSE-IT Eng-Infra to bring the machine back into operation for o3 as the machine is a bare-metal installation we would rely on ASM+IPMI (HMC not needed) and system ethernet in the o3 network.

Acceptance criteria

  • AC1: qa-power8-3 openQA instances are able to pass o3 openQA jobs after the move to PRG2

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for nowResolvedokurz2020-12-152021-04-16

Actions
Copied from QA - action #139112: Ensure OSD openQA PowerPC machine grenache is operational from PRG2Resolvednicksinger2023-06-29

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #139112: Ensure OSD openQA PowerPC machine grenache is operational from PRG2 added
Actions #2

Updated by okurz 5 months ago

  • Target version changed from future to Ready

As decided when estimating #139199 there might be too many problems at once there so we opt to work on this one here first #139115

Actions #3

Updated by okurz 5 months ago

  • Subject changed from Ensure o3 openQA PowerPC machine qa-power8-3 is operational from PRG2 to Ensure o3 openQA PowerPC machine qa-power8-3 is operational from PRG2 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by mkittler 5 months ago

  • Assignee set to mkittler
Actions #5

Updated by mkittler 5 months ago

Check if one of the specified interfaces show up in o3 dhcp logs (dnsmasq)

None of the MAC addresses from https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=2352 show up in the dnsmasq.service journal on ariel.

Actions #6

Updated by mkittler 5 months ago

It is not clear how to access the ASM so I've asked on Slack, see https://suse.slack.com/archives/C04MDKHQE20/p1701696302706239.

Actions #7

Updated by okurz 5 months ago

  • Status changed from Workable to Feedback
  • Assignee changed from mkittler to okurz
  • Priority changed from Normal to Low
  • Target version changed from Ready to Tools - Next

https://suse.slack.com/archives/C04MDKHQE20/p1701704953616129?thread_ts=1701696302.706239&cid=C04MDKHQE20

(lots and lots of confusing discussions …)
(Jiri Novak) anthony conformed me those are power machines that were just carried over and powered on, as they have nothing like ipmi.
it had ip in 192.168.112.x according to racktables, is anyo ther machine using it in the same vlan? if yes, try to ping/ssh to it from there. if not , i guess adding that subnet somewhere would do
(Oliver Kurz) The machine does have an FSP offering an HMC which is configured to offer IPMI over that ethernet interface. 192.168.112.x was the address in the system ethernet connection, not the HMC. and I have double-checked: The mac address part 6C:AE:8B:02:E3 does not show up in DHCP logs on o3 so that machine never requested an address on the system ethernet interface in the network where it should be. @Jiri Novak @Anthony Stalker if you are sure that the system is connected then please update https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=2352 as the entries for connection to qanet15nue.qa are certainly not up-to-date
(Anthony Stalker) The system is only racked. It is not connected to power or network cabled.
(Oliver Kurz) ok, so https://suse.slack.com/archives/C04MDKHQE20/p1701702995472689?thread_ts=1701696302.706239&cid=C04MDKHQE20 is wrong then? What's the plan and having those connected then?
(Anthony Stalker) That's a project management competence, I do not know and I cannot give you an answer.
(Oliver Kurz) @Moroni Flores I failed to find a Jira task I could follow for the setup of the machine "qa-power8-3" in rack PRG2-J12. I know only of https://jira.suse.com/browse/ENGINFRA-2009 which apparently did not include all the cabling and setup. Can you help to reference me the according card that we should wait for?

Actions #8

Updated by okurz 4 months ago

No response. I asked jford about it and he promised to provide according jira tasks that we can follow, at least. So far also did not happen.

Actions #9

Updated by okurz 4 months ago

  • Status changed from Feedback to Blocked

As decided with mhaeffner I now created a specific Jira card myself: https://jira.suse.com/browse/ENGINFRA-3692

Actions #10

Updated by okurz 30 days ago

  • Status changed from Blocked to In Progress
  • Target version changed from Tools - Next to Ready

progress in https://jira.suse.com/browse/ENGINFRA-3692

gschlotter contacted me. The machine's IPMI should be reachable now over oqa-jumpy.dmz-prg2.suse.org
I created
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/762
to add credentials

Actions #11

Updated by okurz 30 days ago · Edited

but the IPMI connection seems to be unstable:

jumpy@oqa-jumpy:~> ipmitool -I lanplus -H qa-power8-3-ipmi … power status
> Error: no response from RAKP 1 message
Error in open session response message : insufficient resources for session

Error: Unable to establish IPMI v2 / RMCP+ session
jumpy@oqa-jumpy:~> ipmitool -I lanplus -H qa-power8-3-ipmi … power status
Error in open session response message : insufficient resources for session

Error: Unable to establish IPMI v2 / RMCP+ session
jumpy@oqa-jumpy:~> ipmitool -I lanplus -H qa-power8-3-ipmi … power status
> Error: no response from RAKP 1 message
Chassis Power is on
Close Session command failed
jumpy@oqa-jumpy:~> ipmitool -I lanplus -H qa-power8-3-ipmi … power status
Chassis Power is on
jumpy@oqa-jumpy:~> ipmitool -I lanplus -H qa-power8-3-ipmi … power status
Chassis Power is on



Close Session command failed

I guess we need to accept that. ASM is also reachable over ssh -L 8080:qa-power8-3-ipmi:80 -NT o3-jumpy.

sol activate shows me petitboot but no boot entries. in the petitboot shell I see devices.

from dmesg

[  228.000800] BTRFS: device fsid 1cd396af-c783-43b9-92d4-210b129a5cdd devid 1 transid 633394 /dev/sda3
[  228.026960] BTRFS info (device dm-2): disk space caching is enabled
[  228.026966] BTRFS: has skinny extents
[  228.054310] BTRFS: detected SSD devices, enabling SSD mode
[  228.597405] device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
[  228.597506] BTRFS: bdev /dev/dm-1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[  228.597520] BTRFS: error (device dm-2) in btrfs_commit_transaction:1950: errno=-5 IO failure (Error while writing out transaction)
[  228.597528] BTRFS warning (device dm-2): Skipping commit of aborted transaction.
[  228.597532] BTRFS: Transaction aborted (error -5)
[  228.597552] ------------[ cut here ]------------
[  228.597555] WARNING: at fs/btrfs/super.c:260
[  228.597558] Modules linked in:
[  228.597567] CPU: 109 PID: 3188 Comm: pb-discover Not tainted 3.18.17-321.el7_1.11.ppc64le #1
[  228.597573] task: c0000007e9f61b20 ti: c0000007e8088000 task.ti: c0000007e8088000
[  228.597578] NIP: c0000000003094d8 LR: c0000000003094d4 CTR: c0000000003fc8d4
[  228.597582] REGS: c0000007e808b540 TRAP: 0700   Not tainted  (3.18.17-321.el7_1.11.ppc64le)
[  228.597585] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48002822  XER: 20000000
[  228.597603] CFAR: c000000000b67dac SOFTE: 1 
[  228.597603] GPR00: c0000000003094d4 c0000007e808b7c0 c000000001182500 0000000000000025 
[  228.597603] GPR04: 0000000000000001 0000000000000000 c0000000010d2500 c0000000011d2500 
[  228.597603] GPR08: 0000000000000007 0000000000000001 0000000000000007 20646574726f6261 
[  228.597603] GPR12: 0000000000002200 c00000000fe94700 c0000007f5e84100 c0000007f5e84ee8 
[  228.597603] GPR16: c0000007f5e84a10 c0000007f5e849e8 0000000000010000 c0000007f5e845a8 
[  228.597603] GPR20: 0000000000000000 0000000000000000 000000000009aa32 c0000007e808b968 
[  228.597603] GPR24: c0000007f5e847b0 c0000007e54c000c fffffffffffffffb c000000000bbb860 
[  228.597603] GPR28: 0000000000000647 c0000007e5370000 fffffffffffffffb c0000007e6961000 
[  228.597662] NIP [c0000000003094d8] __btrfs_abort_transaction+0x68/0x128
[  228.597668] LR [c0000000003094d4] __btrfs_abort_transaction+0x64/0x128
[  228.597671] Call Trace:
[  228.597677] [c0000007e808b7c0] [c0000000003094d4] __btrfs_abort_transaction+0x64/0x128 (unreliable)
[  228.597686] [c0000007e808b850] [c000000000334074] cleanup_transaction+0x88/0x27c
[  228.597692] [c0000007e808b8f0] [c00000000033541c] btrfs_commit_transaction+0xa4c/0xa58
[  228.597699] [c0000007e808b9f0] [c00000000032f63c] btrfs_commit_super+0xa0/0xac
[  228.597705] [c0000007e808ba20] [c000000000332bb8] open_ctree+0x1984/0x1d88
[  228.597711] [c0000007e808bb60] [c00000000030aae0] btrfs_mount+0x510/0x864
[  228.597718] [c0000007e808bc70] [c00000000013fbb0] mount_fs+0x2c/0xc4
[  228.597725] [c0000007e808bcf0] [c00000000015ae70] vfs_kern_mount+0x64/0x140
[  228.597731] [c0000007e808bd40] [c00000000015e6ac] do_mount+0x9ac/0xb18
[  228.597737] [c0000007e808bdd0] [c00000000015ea74] SyS_mount+0x90/0xc8
[  228.597744] [c0000007e808be30] [c000000000009198] syscall_exit+0x0/0x98
[  228.597747] Instruction dump:
[  228.597750] 7d0050a8 7d074b78 7ce051ad 40c2fff4 7c0004ac 7909f7e3 40e2001c 3c62ffd8 
[  228.597761] 7fc4f378 3863ffc7 4885e88d 60000000 <0fe00000> e93d0028 b3dd0050 2fa90000 
[  228.597773] ---[ end trace 76e3c7c43c9dd2c2 ]---
[  228.597778] BTRFS: error (device dm-2) in cleanup_transaction:1607: errno=-5 IO failure
[  228.597785] BTRFS info (device dm-2): delayed_refs has NO entry
[  228.723072] BTRFS: open_ctree failed
[  229.030163] device-mapper: snapshots: Snapshot is marked invalid.
[  229.031832] EXT4-fs (dm-2): unable to read superblock

but ip a shows that we have received an IP address on both eth0+eth1 and I can confirm the opposite from o3 so all good regarding network setup.

I set https://jira.suse.com/browse/ENGINFRA-3692 to done.

Actions #12

Updated by okurz 29 days ago

  • Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added
Actions #13

Updated by okurz 29 days ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)
  • Priority changed from Low to High
Actions #14

Updated by nicksinger 24 days ago

  • Assignee set to nicksinger
Actions #15

Updated by nicksinger 24 days ago

  • Status changed from Workable to In Progress

I was able to mount /dev/sda3 with the root fs of the system from petitboot and chrooted into the system following https://wiki.gentoo.org/wiki/Chroot#Configuration - now trying to understand why petitboot fails to find the partition, how to fix it and if we face some hardware issue.

Actions #16

Updated by nicksinger 24 days ago

did a zypper ref && zypper dup from within the chroot. This installed a new kernel which was sufficient enough to make the partition visible in petitboot. The system was able to boot again and we can now even ssh over ariel into it (so as confirmed previously, network is perfectly fine for this machine).

After booting I checked the FS on /dev/sda with btrfs scrub status / which reports "no errors found". Checking /boot I can only see one single kernel image (vmlinux-5.3.18-150300.59.93-default) and given that petitboot was not able to boot, I assume we had no kernel at all before I chrooted into the machine.

I remember we had something like this in the past and asked the team if anyone can remember: https://suse.slack.com/archives/C02AJ1E568M/p1712141650548919
I will also cross-check our OSD power machines to see if we might have to adjust our package lock which currently is:

qa-power8-3:~ # zypper ll

# | Name           | Type    | Repository | Comment
--+----------------+---------+------------+------------------------------------------
1 | kernel-default | package | (any)      | poo#119008, kernel regression boo#1202138
2 | util-linux     | package | (any)      | poo#119008, kernel regression boo#1202138
Actions #17

Updated by nicksinger 24 days ago

worker already did pick up jobs of o3 again and was successful: https://openqa.opensuse.org/tests/4058770
https://openqa.opensuse.org/admin/workers shows a lot of "localhost:{1..8}" workers with the qa-power8-3 worker-class. @okurz restarted all workers and now they show up properly again. Now rebooting the machine to check if we can reproduce the issue.

Actions #18

Updated by nicksinger 24 days ago

  • Status changed from In Progress to Resolved

/etc/hostname was empty after a reboot. Used hostnamectl hostname --static qa-power8-3 to set it which also set the content of /etc/hostname. After another reboot, the system came up with a proper hostname also in openQA. Most recent job which proves the machines does what we expect: https://openqa.opensuse.org/tests/4058726

Actions

Also available in: Atom PDF