coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #131525: [epic] Up-to-date and usable LSG QE NUE1 machines

Bring up-to-date size:M

Added by okurz 12 months ago. Updated 11 months ago.

Motivation says "BROKEN" but we don't have more detail.

Acceptance criteria

  • AC1: The machine entry in netbox is up-to-date and has either "Move to …" or "To be decommissioned" tag
  • AC2: We know if/how the machine is usable and how it is used


  • Tag with "Move to Frankencampus" for now and ask former owners, admins, users about the machine
  • When we happen to go to NUE1, physically check the machine

Related to QA - action #125234: Decommission obsolete machines in size:MResolvedokurz2023-03-01

Copied to openQA Infrastructure - action #134453: is Failed according to netbox and not creating backups size:MResolvedmkittler

Updated by okurz 12 months ago

  • Status changed from New to Feedback

Updated the netbox entry with "Move to Frankencampus" and "QE LSG" for now.

Asked in

@Heiko Rommel @Antonios Pappas @Martin Pluskal says is "BROKEN". Do you have more details about that? I just created about this

Updated by okurz 12 months ago

  • Related to action #125234: Decommission obsolete machines in size:M added
Updated by okurz 12 months ago

  • Tags changed from infra,, machine, nue1, dct migration to infra,, machine, nue1, dct migration, next-maxtorhof-visit
  • Subject changed from Bring up-to-date to Bring up-to-date size:M
  • Description updated (diff)
Updated by okurz 12 months ago

Response by hrommel:

I remember two issues: the Java app on the IPMI based web console is broken/unusable ( the software RAID has a failed drive and is currently not redundant (see mdadm --detail /dev/md126)

I can login to and for the Java app the usual quirks apply as documented in

Asked in

Can you clarify how is being used? What should be our plans regarding it? We can look for a storage device replacement if we intend to continue running this machine.

Updated by okurz 12 months ago

  • Status changed from Feedback to In Progress

Found new BMC firmware on that could help, upgrading from 1.90 to 3.90.

For the hardware replacement cat /proc/mdstat shows that sdc is failed. hdparm -I /dev/sdc reveals

        Model Number:       ST2000DM001-1ER164                      
        Serial Number:      S4Z0K1YM

smartctl --all /dev/sdc says that the SMART overall health is PASSED. I triggered smartctl --test=long /dev/sdc. Expected to complete "after Wed Jul 5 13:10:05 2023". Will look again after that time.

Also I found in dmesg that the XFS on /backup is reported as having corrupted metadata so I unmounted /backup and running xfs_repair /dev/md126

On the host in /backup I called find -mtime -30 and found no files changed within the last 30 days. I don't think any backup is still being collected. Also find -mtime -300 does not return anything.

Trying to upgrade OS with zypper migration. First patches were missing. Then the subsequent check failed with

Can't determine the list of installed products: Errno::ETIMEDOUT: Connection timed out - connect(2) for "" port 443

I called SUSEConnect --cleanup && SUSEConnect --regcode … --url using a regcode "SUSE Employee subscription for SUSE Linux Enterprise Server x86_64" from SCC Beta Registration Codes A176543 on Now running zypper dup before trying again with the actual migration.

That said there would be no available migrations. So I called SUSEConnect -p SLES-LTSS/12.4/x86_64 -r …. It said something about other extensions not enabled so I went crazy and did

SUSEConnect -p sle-module-toolchain/12/x86_64
SUSEConnect -p sle-sdk/12.4/x86_64
SUSEConnect -p sle-live-patching/12.4/x86_64 --regcode …

with the regcode for the latter from "SUSE Linux Enterprise Live Patching test subscription". Despite the that command saying that it would have registered the system successfully both SUSEConnect --status-text as well as zypper migration claim that the live patching product is not registered so this won't work. We could try an offline migration or just side-grade to Leap. Remote management with IPMI and over the web interface is working fine so let's just go ahead with that. I am mostly following what I did in #117679 and #125534-6. First I deregistered the system and ensured I can access over IPMI. Though I did not see any response on ipmitool … sol activate.

I copied over a clean set of repo files out of a Leap 15.5 container with

mkdir -p ~/local/tmp/leap_15.5_repos && podman run --pull=newer --rm -it -v /home/okurz/local/tmp/leap_15.5_repos:/tmp/repos 'cp -a /etc/zypp/repos.d/* /tmp/repos/'

and then from my computer to with

rsync -aHP ~/local/tmp/leap_15.5_repos/

and on

cd /etc/zypp/
mv repos.d repos.d.sle12sp4_old
mv repos.d.leap_15.5/ repos.d
export releasever=15.5
zypper --gpg-auto-import-keys --releasever=$releasever -n ref && zypper --releasever=$releasever -n dup --details --allow-vendor-change --allow-downgrade --replacefiles --auto-agree-with-licenses --download-in-advance && zypper --releasever=$releasever -n in --force openSUSE-release

after two retries that went fine. Then I rebooted and the machine came up again but there was no grub or kernel output on serial port.
I changed some settings, e.g. in /etc/default/grub from GRUB_SERIAL_COMMAND="serial --unit=0 --speed=115200 --parity=no" to GRUB_SERIAL_COMMAND="serial --unit=1 --speed=115200. Some systemd services failed, e.g. ypbind. I took over the config from Also libvirt related services failed. I assume those services are currently not needed so I disabled them systemctl disable --now libvirtd.service libvirtd-admin.socket libvirtd-ro.socket libvirtd.socket. Regarding the RAID it seems all devices are ok but maybe a device was temporarily lost in connection. for i in /dev/sd? ; do smartctl --all $i | grep 'SMART overall-health'; done looks fine:

SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED

So I just did mdadm --add /dev/md/backup /dev/sdd. Now the array is rebuilding. Stated to take 180m. Looking back later.

Updated by okurz 12 months ago

  • Status changed from In Progress to Blocked

Fixed the serial output of grub menu and linux and tty getty with entries in /etc/default/grub to point to the 3rd serial port

GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS2,115200n8 noresume nospec crashkernel=222M-:111M showopts vga=0"
GRUB_SERIAL_COMMAND="serial --unit=2 --speed=115200"

After another reboot the service rc-local.service failed due to /dev/sdf vanished. Where did it go? Still missing after another reboot. I think it's still a good idea as noted in netbox that the machine should move to FC Basement so that we check physically.

Waiting for #100455 to conclude which might shed some light on when we would have the machine in FC Basement.

Updated by okurz 11 months ago

  • Description updated (diff)
  • Status changed from Blocked to Resolved

The machine was moved to FC Basement as part of #132614 and is up and running. is currently not in sync with racktables but is up-to-date. I also added IP addresses in racktables.

I followed the salt-minion workaround from and then and applied a high state on the machine.

In /var/log/salt/minion I found

2023-08-04 21:38:13,990 [salt.state       :319 ][ERROR   ][22865] Unable to trigger watch for service.enabled
2023-08-04 21:39:40,148 [salt.minion      :1988][WARNING ][27863] Passed invalid arguments to state.highstate: expected str, bytes or os.PathLike object, not list

    Retrieve the state data from the salt master for this minion and execute it

and one state failed to apply (225 succeeded) but I could not identify the failed one. Running again with sudo salt --no-color --state-output=changes '' state.apply queue=True | awk '/Result: Clean - Started/ {print > "/tmp/salt_profiling.log"; next} 1' multiple times yields a stable success and all looks good on the machine now.

Updated by livdywan 10 months ago

  • Copied to action #134453: is Failed according to netbox and not creating backups size:M added

