Project

General

Profile

Actions

action #131528

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #131525: [epic] Up-to-date and usable LSG QE NUE1 machines

Bring backup.qam.suse.de up-to-date size:M

Added by okurz 10 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2023-06-28
Due date:
% Done:

0%

Estimated time:

Description

Motivation

https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9264 says "BROKEN" but we don't have more detail.

Acceptance criteria

  • AC1: The machine entry in netbox https://netbox.suse.de/dcim/devices/6343/ is up-to-date and has either "Move to …" or "To be decommissioned" tag
  • AC2: We know if/how the machine is usable and how it is used

Suggestions

  • Tag with "Move to Frankencampus" for now and ask former owners, admins, users about the machine
  • When we happen to go to NUE1, physically check the machine

Related issues 2 (0 open2 closed)

Related to QA - action #125234: Decommission obsolete machines in qam.suse.de size:MResolvedokurz2023-03-01

Actions
Copied to openQA Infrastructure - action #134453: backup.qam.suse.de is Failed according to netbox and not creating backups size:MResolvedmkittler

Actions
Actions #1

Updated by okurz 10 months ago

  • Status changed from New to Feedback

Updated the netbox entry with "Move to Frankencampus" and "QE LSG" for now.

Asked in https://suse.slack.com/archives/C02CANHLANP/p1687950845184129

@Heiko Rommel @Antonios Pappas @Martin Pluskal https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9264 says backup.qam.suse.de is "BROKEN". Do you have more details about that? I just created https://progress.opensuse.org/issues/131528 about this

Actions #2

Updated by okurz 10 months ago

  • Related to action #125234: Decommission obsolete machines in qam.suse.de size:M added
Actions #3

Updated by okurz 10 months ago

  • Tags changed from infra, backup.qam.suse.de, machine, nue1, dct migration to infra, backup.qam.suse.de, machine, nue1, dct migration, next-maxtorhof-visit
  • Subject changed from Bring backup.qam.suse.de up-to-date to Bring backup.qam.suse.de up-to-date size:M
  • Description updated (diff)
Actions #4

Updated by okurz 10 months ago

Response by hrommel:

I remember two issues: the Java app on the IPMI based web console is broken/unusable (http://backup-mgmt.qam.suse.de/). the software RAID has a failed drive and is currently not redundant (see mdadm --detail /dev/md126)

I can login to http://backup-mgmt.qam.suse.de/ and for the Java app the usual quirks apply as documented in https://progress.opensuse.org/projects/openqav3/wiki/#Accessing-old-BMCs-with-Java-iKVM-Viewer-when-ipmitool-does-not-work-eg-imagetester

Asked in https://suse.slack.com/archives/C02CANHLANP/p1688540084286679?thread_ts=1687950845.184129&cid=C02CANHLANP

Can you clarify how backup.qam.suse.de is being used? What should be our plans regarding it? We can look for a storage device replacement if we intend to continue running this machine.

Actions #5

Updated by okurz 10 months ago

  • Status changed from Feedback to In Progress

Found new BMC firmware on https://www.supermicro.com/en/support/resources/downloadcenter/firmware/MBD-X10SLH-F/BMC REDFISH_X10_390_20200717_unsigned.zip that could help, upgrading from 1.90 to 3.90.

For the hardware replacement cat /proc/mdstat shows that sdc is failed. hdparm -I /dev/sdc reveals

        Model Number:       ST2000DM001-1ER164                      
        Serial Number:      S4Z0K1YM

smartctl --all /dev/sdc says that the SMART overall health is PASSED. I triggered smartctl --test=long /dev/sdc. Expected to complete "after Wed Jul 5 13:10:05 2023". Will look again after that time.

Also I found in dmesg that the XFS on /backup is reported as having corrupted metadata so I unmounted /backup and running xfs_repair /dev/md126

On the host in /backup I called find -mtime -30 and found no files changed within the last 30 days. I don't think any backup is still being collected. Also find -mtime -300 does not return anything.

Trying to upgrade OS with zypper migration. First patches were missing. Then the subsequent check failed with

Can't determine the list of installed products: Errno::ETIMEDOUT: Connection timed out - connect(2) for "smt-scc.nue.suse.com" port 443

I called SUSEConnect --cleanup && SUSEConnect --regcode … --url https://scc.suse.com using a regcode "SUSE Employee subscription for SUSE Linux Enterprise Server x86_64" from SCC Beta Registration Codes A176543 on scc.suse.com. Now running zypper dup before trying again with the actual migration.

That said there would be no available migrations. So I called SUSEConnect -p SLES-LTSS/12.4/x86_64 -r …. It said something about other extensions not enabled so I went crazy and did

SUSEConnect -p sle-module-toolchain/12/x86_64
SUSEConnect -p sle-sdk/12.4/x86_64
SUSEConnect -p sle-live-patching/12.4/x86_64 --regcode …

with the regcode for the latter from "SUSE Linux Enterprise Live Patching test subscription". Despite the that command saying that it would have registered the system successfully both SUSEConnect --status-text as well as zypper migration claim that the live patching product is not registered so this won't work. We could try an offline migration or just side-grade to Leap. Remote management with IPMI and over the web interface is working fine so let's just go ahead with that. I am mostly following what I did in #117679 and #125534-6. First I deregistered the system and ensured I can access over IPMI. Though I did not see any response on ipmitool … sol activate.

I copied over a clean set of repo files out of a Leap 15.5 container with

mkdir -p ~/local/tmp/leap_15.5_repos && podman run --pull=newer --rm -it -v /home/okurz/local/tmp/leap_15.5_repos:/tmp/repos registry.opensuse.org/opensuse/leap:15.5 'cp -a /etc/zypp/repos.d/* /tmp/repos/'

and then from my computer to backup.qam.suse.de with

rsync -aHP ~/local/tmp/leap_15.5_repos/ root@backup.qam.suse.de:/etc/zypp/repos.d.leap_15.5/

and on backup.qam.suse.de:

cd /etc/zypp/
mv repos.d repos.d.sle12sp4_old
mv repos.d.leap_15.5/ repos.d
export releasever=15.5
zypper --gpg-auto-import-keys --releasever=$releasever -n ref && zypper --releasever=$releasever -n dup --details --allow-vendor-change --allow-downgrade --replacefiles --auto-agree-with-licenses --download-in-advance && zypper --releasever=$releasever -n in --force openSUSE-release

after two retries that went fine. Then I rebooted and the machine came up again but there was no grub or kernel output on serial port.
I changed some settings, e.g. in /etc/default/grub from GRUB_SERIAL_COMMAND="serial --unit=0 --speed=115200 --parity=no" to GRUB_SERIAL_COMMAND="serial --unit=1 --speed=115200. Some systemd services failed, e.g. ypbind. I took over the config from login.suse.de. Also libvirt related services failed. I assume those services are currently not needed so I disabled them systemctl disable --now libvirtd.service libvirtd-admin.socket libvirtd-ro.socket libvirtd.socket. Regarding the RAID it seems all devices are ok but maybe a device was temporarily lost in connection. for i in /dev/sd? ; do smartctl --all $i | grep 'SMART overall-health'; done looks fine:

SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED

So I just did mdadm --add /dev/md/backup /dev/sdd. Now the array is rebuilding. Stated to take 180m. Looking back later.

Actions #6

Updated by okurz 10 months ago

  • Status changed from In Progress to Blocked

Fixed the serial output of grub menu and linux and tty getty with entries in /etc/default/grub to point to the 3rd serial port

GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS2,115200n8 noresume nospec crashkernel=222M-:111M showopts vga=0"
…
GRUB_SERIAL_COMMAND="serial --unit=2 --speed=115200"

After another reboot the service rc-local.service failed due to /dev/sdf vanished. Where did it go? Still missing after another reboot. I think it's still a good idea as noted in netbox that the machine should move to FC Basement so that we check physically.

Waiting for #100455 to conclude which might shed some light on when we would have the machine in FC Basement.

Actions #7

Updated by okurz 9 months ago

  • Description updated (diff)
  • Status changed from Blocked to Resolved

The machine was moved to FC Basement as part of #132614 and is up and running. https://netbox.suse.de/dcim/devices/6343/ is currently not in sync with racktables but https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9264 is up-to-date. I also added IP addresses in racktables.

I followed the salt-minion workaround from
https://progress.opensuse.org/projects/openqav3/wiki/#Network-legacy-boot-via-PXE-and-OSworker-setup and then https://gitlab.suse.de/openqa/salt-states-openqa#openqa-salt-states and applied a high state on the machine.

In /var/log/salt/minion I found

2023-08-04 21:38:13,990 [salt.state       :319 ][ERROR   ][22865] Unable to trigger watch for service.enabled
2023-08-04 21:39:40,148 [salt.minion      :1988][WARNING ][27863] Passed invalid arguments to state.highstate: expected str, bytes or os.PathLike object, not list

    Retrieve the state data from the salt master for this minion and execute it

and one state failed to apply (225 succeeded) but I could not identify the failed one. Running again with sudo salt --no-color --state-output=changes 'backup-qam.qe.nue2.suse.org' state.apply queue=True | awk '/Result: Clean - Started/ {print > "/tmp/salt_profiling.log"; next} 1' multiple times yields a stable success and all looks good on the machine now.

Actions #8

Updated by livdywan 9 months ago

  • Copied to action #134453: backup.qam.suse.de is Failed according to netbox and not creating backups size:M added
Actions

Also available in: Atom PDF