action #170077
closedcoordination #161414: [epic] Improved salt based infrastructure management
Put more storage into qamaster "to make our lives easier in general" size:M
0%
Description
Motivation¶
Based on suggestion by Nick Singer. We can check the physical slots of the machine and see if we have spare devices that would help us. okurz thinks we have some.
Acceptance criteria¶
- AC1: Significantly more free space on qamaster
Suggestions¶
- First ensure we have planned and executed proper backup, i.e. other ticket(s)
- Migrate away production grade workload to more modern platforms, e.g. OpenPlatform, also in other tickets, wait for that
- Be careful with the MegaRAID storage controller as you need to use /opt/MegaRAID/storcli/storcli64 and such
- Then look into using existing unused storage devices or put in new hardware and ensure data partitions can use it
- "Significantly more free space" means much less than the current 82% usage of /dev/sdb2 which is used for /var/lib/libvirt/images
Updated by nicksinger 3 months ago
details in racktable hint to a chassis with 8x3.5" slots and the OS currently uses 3
Updated by okurz 3 months ago
qamaster has 12 (!) physical storage devices. In the OS we have a 600GB "sda" and 4TB "sdb" but there does not seem to be a physical 4TB device so I assume we have a hardware RAID0 or RAID5 or similar. Physically attached display+keyboard. Booting without devices to understand which are connected to an internal storage controller or how this works. "Entering setup…" stays there for long, since 12:39Z until 12:41Z so don't be surprised that it takes 3m to reach the BIOS. BIOS SATA Configuration says it has 6 ports, port 0 through 5 with AHCI mode and hot plug enabled for all. Then there is a page "SCU Configuration" where "Storage Controller Unit" was disabled. Now enabled. Port 0 through 7, all "not present". Also enabled "EMS Console Redirection", "Out-of-Band Mgmt Port COM2/SOL". Maybe can see more over IPMI. On IPMI SOL I could see the BIOS screen but maybe we had that already in before. I plugged in storage slot 0, see blue led, but both SATA and SCU port 0 show "not present". Also plugged in port 11, see red led. "not present". Restarting system and entering setup again. Still no slots show up. Exited setup and even after 5m of waiting the screen outside setup is just black and machine does not respond. I entered setup again and disabled the SKU controller again and rebooted (13:14Z), 1320Z not up. Disabled "EMS Console Redirection" now, 1323Z. Also machine beeps, like in #114893. The two new devices are currently not connected. Now it's priority to bring back the machine as-is. The two trays with the new disks I for now put in the storage cabinet. No luck to bring the machine back up so far.
Updated by okurz 3 months ago
- Related to action #170026: [QA][tools][monitor] monitor.qa.suse.de is down added
Updated by okurz 3 months ago · Edited
- Status changed from New to In Progress
- Priority changed from Urgent to High
On boot I could press ctrl-h and reach the RAID controller firmware menu. From there found "foreign config" for DG0 and DG1 but "unconfigured" for slot 10+11 which are also the one showing up with a red light. At least boot to root should work with this. Exited, reboot, entered again, verified valid config to this point. But after another reboot I can still not boot from local disk. nicksinger has enabled network boot and PXE+EFI. System ends up in EFI shell. Need to exit with command "exit". The network boot with DHCP was showing many times. Eventually I booted a Tumbleweed system with ttyS1 which showed output on IPMI SoL.
The good thing is: I'm in a live Tumbleweed system and can confirm that both the root partition as well as the VM data partition is fully usable. Interesting is that we have two 2TB disks which were apparently configured for RAID0 but not used in the past years(?). And the system does not boot up yet but I am relieved so far.
Reconfigured boot settings with nicksinger. System came up again. VMs are running. So at least recovered up to that point. Things to do:
- iPXE should also display on local console, e.g. add
console=tty1 console=ttyS1
(or the other way around) to display also something on local screen, not just remotely -> #173344 - Create backup of backup and VMs, config, jenkins, etc. -> #173347
- Migrate VMs to modern hypervisor solution, e.g. openplatform -> #173350
- Physically label slot 10+11 -> #173353
- Bring slot 10+11 into use, maybe at best software RAID0 or RAID1, not hardware RAID
- check if settings "console redirection EMS" helps us, e.g. to mirror more output to physical monitor and SoL
- Document that KVMViewer can output VGA whereas IPMI SoL only serial (is that right?) -> https://gitlab.suse.de/suse/wiki/-/merge_requests/5
Updated by openqa_review 3 months ago
- Due date set to 2024-12-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 3 months ago
- Copied to action #173344: Extend iPXE in qe/oqa.*.suse.org to also display on local console size:S added
Updated by okurz 3 months ago
- Copied to action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added
Updated by livdywan about 2 months ago
- Status changed from Blocked to Workable
Looks to me like all relevant blockers were resolved.
Updated by okurz about 2 months ago
/opt/MegaRAID/storcli/storcli64 /c0 show
shows
PD LIST :
=======
------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
------------------------------------------------------------------------------
0:0 6 Onln 0 278.875 GB SAS HDD N N 512B ST9300605SS U -
0:1 3 Onln 0 278.875 GB SAS HDD N N 512B ST9300605SS U -
0:2 1 Onln 0 278.875 GB SAS HDD N N 512B ST9300605SS U -
0:3 5 Onln 0 278.875 GB SAS HDD N N 512B ST9300605SS U -
0:4 14 Onln 1 931.0 GB SATA HDD Y N 512B ST91000642NS U -
0:5 4 Onln 1 931.0 GB SAS HDD N N 512B ST91000640SS U -
0:6 8 Onln 1 931.0 GB SAS HDD N N 512B ST91000640SS U -
0:7 10 Onln 1 931.0 GB SAS HDD N N 512B ST91000640SS U -
0:8 13 Onln 1 931.0 GB SAS HDD Y N 512B ST91000642SS U -
0:9 7 Onln 1 931.0 GB SAS HDD N N 512B ST91000640SS U -
0:10 11 UBad - 1.818 TB SATA HDD N N 4 KB ST2000NX0243 U -
0:11 12 UBad - 1.818 TB SATA HDD N N 4 KB ST2000NX0243 U -
------------------------------------------------------------------------------
note the last two devices in state "UBad". I suggest to configure both those devices to be passed through individually to the OS and use both devices for more space.
Updated by okurz about 1 month ago
- Status changed from Workable to In Progress
sudo /opt/MegaRAID/storcli/storcli64 /c0 /e0 /s10 set good
sudo /opt/MegaRAID/storcli/storcli64 /c0 /e0 /s11 set good
Updated by okurz about 1 month ago
- Status changed from In Progress to Resolved
sudo /opt/MegaRAID/storcli/storcli64 /c0 /fall import
sudo /opt/MegaRAID/storcli/storcli64 /c0 /e0 /sall show rebuild
shows all as "Not in progress"
lsblk
shows a free 3.6T in /dev/sdc. I checked the filesystem on /dev/sdc1 and it was a broken XFS filesystem, likely not recoverable but also not important. I removed the partition and created a btrfs filesystem and added an according mountpoint with
mkfs.btrfs -f /dev/sdc
mkdir -p /srv/storage
echo 'UUID=71acb0e1-9dc8-40a8-a539-226053a33c4d /srv/storage btrfs defaults 0 0' >> /etc/fstab
mount -a
so additional 3.7TB of free space are available in /srv/storage. It's a not so safe RAID0 of rather old disks so I advise to use it as backup target or non-critical data.