action #131528
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #131525: [epic] Up-to-date and usable LSG QE NUE1 machines
Bring backup.qam.suse.de up-to-date size:M
0%
Description
Motivation¶
https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9264 says "BROKEN" but we don't have more detail.
Acceptance criteria¶
- AC1: The machine entry in netbox https://netbox.suse.de/dcim/devices/6343/ is up-to-date and has either "Move to …" or "To be decommissioned" tag
- AC2: We know if/how the machine is usable and how it is used
Suggestions¶
- Tag with "Move to Frankencampus" for now and ask former owners, admins, users about the machine
- When we happen to go to NUE1, physically check the machine
Updated by okurz over 1 year ago
- Status changed from New to Feedback
Updated the netbox entry with "Move to Frankencampus" and "QE LSG" for now.
Asked in https://suse.slack.com/archives/C02CANHLANP/p1687950845184129
@Heiko Rommel @Antonios Pappas @Martin Pluskal https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9264 says backup.qam.suse.de is "BROKEN". Do you have more details about that? I just created https://progress.opensuse.org/issues/131528 about this
Updated by okurz over 1 year ago
- Related to action #125234: Decommission obsolete machines in qam.suse.de size:M added
Updated by okurz over 1 year ago
- Tags changed from infra, backup.qam.suse.de, machine, nue1, dct migration to infra, backup.qam.suse.de, machine, nue1, dct migration, next-maxtorhof-visit
- Subject changed from Bring backup.qam.suse.de up-to-date to Bring backup.qam.suse.de up-to-date size:M
- Description updated (diff)
Updated by okurz over 1 year ago
Response by hrommel:
I remember two issues: the Java app on the IPMI based web console is broken/unusable (http://backup-mgmt.qam.suse.de/). the software RAID has a failed drive and is currently not redundant (see mdadm --detail /dev/md126)
I can login to http://backup-mgmt.qam.suse.de/ and for the Java app the usual quirks apply as documented in https://progress.opensuse.org/projects/openqav3/wiki/#Accessing-old-BMCs-with-Java-iKVM-Viewer-when-ipmitool-does-not-work-eg-imagetester
Can you clarify how backup.qam.suse.de is being used? What should be our plans regarding it? We can look for a storage device replacement if we intend to continue running this machine.
Updated by okurz over 1 year ago
- Status changed from Feedback to In Progress
Found new BMC firmware on https://www.supermicro.com/en/support/resources/downloadcenter/firmware/MBD-X10SLH-F/BMC REDFISH_X10_390_20200717_unsigned.zip that could help, upgrading from 1.90 to 3.90.
For the hardware replacement cat /proc/mdstat
shows that sdc is failed. hdparm -I /dev/sdc
reveals
Model Number: ST2000DM001-1ER164
Serial Number: S4Z0K1YM
smartctl --all /dev/sdc
says that the SMART overall health is PASSED. I triggered smartctl --test=long /dev/sdc
. Expected to complete "after Wed Jul 5 13:10:05 2023". Will look again after that time.
Also I found in dmesg that the XFS on /backup is reported as having corrupted metadata so I unmounted /backup and running xfs_repair /dev/md126
On the host in /backup I called find -mtime -30
and found no files changed within the last 30 days. I don't think any backup is still being collected. Also find -mtime -300
does not return anything.
Trying to upgrade OS with zypper migration
. First patches were missing. Then the subsequent check failed with
Can't determine the list of installed products: Errno::ETIMEDOUT: Connection timed out - connect(2) for "smt-scc.nue.suse.com" port 443
I called SUSEConnect --cleanup && SUSEConnect --regcode … --url https://scc.suse.com
using a regcode "SUSE Employee subscription for SUSE Linux Enterprise Server x86_64" from SCC Beta Registration Codes A176543 on scc.suse.com. Now running zypper dup
before trying again with the actual migration.
That said there would be no available migrations. So I called SUSEConnect -p SLES-LTSS/12.4/x86_64 -r …
. It said something about other extensions not enabled so I went crazy and did
SUSEConnect -p sle-module-toolchain/12/x86_64
SUSEConnect -p sle-sdk/12.4/x86_64
SUSEConnect -p sle-live-patching/12.4/x86_64 --regcode …
with the regcode for the latter from "SUSE Linux Enterprise Live Patching test subscription". Despite the that command saying that it would have registered the system successfully both SUSEConnect --status-text
as well as zypper migration
claim that the live patching product is not registered so this won't work. We could try an offline migration or just side-grade to Leap. Remote management with IPMI and over the web interface is working fine so let's just go ahead with that. I am mostly following what I did in #117679 and #125534-6. First I deregistered the system and ensured I can access over IPMI. Though I did not see any response on ipmitool … sol activate
.
I copied over a clean set of repo files out of a Leap 15.5 container with
mkdir -p ~/local/tmp/leap_15.5_repos && podman run --pull=newer --rm -it -v /home/okurz/local/tmp/leap_15.5_repos:/tmp/repos registry.opensuse.org/opensuse/leap:15.5 'cp -a /etc/zypp/repos.d/* /tmp/repos/'
and then from my computer to backup.qam.suse.de with
rsync -aHP ~/local/tmp/leap_15.5_repos/ root@backup.qam.suse.de:/etc/zypp/repos.d.leap_15.5/
and on backup.qam.suse.de:
cd /etc/zypp/
mv repos.d repos.d.sle12sp4_old
mv repos.d.leap_15.5/ repos.d
export releasever=15.5
zypper --gpg-auto-import-keys --releasever=$releasever -n ref && zypper --releasever=$releasever -n dup --details --allow-vendor-change --allow-downgrade --replacefiles --auto-agree-with-licenses --download-in-advance && zypper --releasever=$releasever -n in --force openSUSE-release
after two retries that went fine. Then I rebooted and the machine came up again but there was no grub or kernel output on serial port.
I changed some settings, e.g. in /etc/default/grub from GRUB_SERIAL_COMMAND="serial --unit=0 --speed=115200 --parity=no"
to GRUB_SERIAL_COMMAND="serial --unit=1 --speed=115200
. Some systemd services failed, e.g. ypbind. I took over the config from login.suse.de. Also libvirt related services failed. I assume those services are currently not needed so I disabled them systemctl disable --now libvirtd.service libvirtd-admin.socket libvirtd-ro.socket libvirtd.socket
. Regarding the RAID it seems all devices are ok but maybe a device was temporarily lost in connection. for i in /dev/sd? ; do smartctl --all $i | grep 'SMART overall-health'; done
looks fine:
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: PASSED
So I just did mdadm --add /dev/md/backup /dev/sdd
. Now the array is rebuilding. Stated to take 180m. Looking back later.
Updated by okurz over 1 year ago
- Status changed from In Progress to Blocked
Fixed the serial output of grub menu and linux and tty getty with entries in /etc/default/grub to point to the 3rd serial port
GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS2,115200n8 noresume nospec crashkernel=222M-:111M showopts vga=0"
…
GRUB_SERIAL_COMMAND="serial --unit=2 --speed=115200"
After another reboot the service rc-local.service failed due to /dev/sdf vanished. Where did it go? Still missing after another reboot. I think it's still a good idea as noted in netbox that the machine should move to FC Basement so that we check physically.
Waiting for #100455 to conclude which might shed some light on when we would have the machine in FC Basement.
Updated by okurz over 1 year ago
- Description updated (diff)
- Status changed from Blocked to Resolved
The machine was moved to FC Basement as part of #132614 and is up and running. https://netbox.suse.de/dcim/devices/6343/ is currently not in sync with racktables but https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9264 is up-to-date. I also added IP addresses in racktables.
I followed the salt-minion workaround from
https://progress.opensuse.org/projects/openqav3/wiki/#Network-legacy-boot-via-PXE-and-OSworker-setup and then https://gitlab.suse.de/openqa/salt-states-openqa#openqa-salt-states and applied a high state on the machine.
In /var/log/salt/minion I found
2023-08-04 21:38:13,990 [salt.state :319 ][ERROR ][22865] Unable to trigger watch for service.enabled
2023-08-04 21:39:40,148 [salt.minion :1988][WARNING ][27863] Passed invalid arguments to state.highstate: expected str, bytes or os.PathLike object, not list
Retrieve the state data from the salt master for this minion and execute it
and one state failed to apply (225 succeeded) but I could not identify the failed one. Running again with sudo salt --no-color --state-output=changes 'backup-qam.qe.nue2.suse.org' state.apply queue=True | awk '/Result: Clean - Started/ {print > "/tmp/salt_profiling.log"; next} 1'
multiple times yields a stable success and all looks good on the machine now.
Updated by livdywan over 1 year ago
- Copied to action #134453: backup.qam.suse.de is Failed according to netbox and not creating backups size:M added