Project

General

Profile

Actions

action #170077

closed

coordination #161414: [epic] Improved salt based infrastructure management

Put more storage into qamaster "to make our lives easier in general" size:M

Added by okurz 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-11-19
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Based on suggestion by Nick Singer. We can check the physical slots of the machine and see if we have spare devices that would help us. okurz thinks we have some.

Acceptance criteria

  • AC1: Significantly more free space on qamaster

Suggestions

  • First ensure we have planned and executed proper backup, i.e. other ticket(s)
  • Migrate away production grade workload to more modern platforms, e.g. OpenPlatform, also in other tickets, wait for that
  • Be careful with the MegaRAID storage controller as you need to use /opt/MegaRAID/storcli/storcli64 and such
  • Then look into using existing unused storage devices or put in new hardware and ensure data partitions can use it
  • "Significantly more free space" means much less than the current 82% usage of /dev/sdb2 which is used for /var/lib/libvirt/images

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #170026: [QA][tools][monitor] monitor.qa.suse.de is downResolvedokurz2024-11-18

Actions
Copied to openQA Infrastructure (public) - action #173344: Extend iPXE in qe/oqa.*.suse.org to also display on local console size:SResolvedgpathak

Actions
Copied to openQA Infrastructure (public) - action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:SResolvedgpathak

Actions
Actions #1

Updated by nicksinger 3 months ago

details in racktable hint to a chassis with 8x3.5" slots and the OS currently uses 3

Actions #2

Updated by okurz 3 months ago

I put one 2TB+1TB into qamaster. Might have broken some RAID . If you see problems feel free to trigger a reboot or power cycle. Won't have time today anymore myself

Actions #4

Updated by okurz 3 months ago

qamaster has 12 (!) physical storage devices. In the OS we have a 600GB "sda" and 4TB "sdb" but there does not seem to be a physical 4TB device so I assume we have a hardware RAID0 or RAID5 or similar. Physically attached display+keyboard. Booting without devices to understand which are connected to an internal storage controller or how this works. "Entering setup…" stays there for long, since 12:39Z until 12:41Z so don't be surprised that it takes 3m to reach the BIOS. BIOS SATA Configuration says it has 6 ports, port 0 through 5 with AHCI mode and hot plug enabled for all. Then there is a page "SCU Configuration" where "Storage Controller Unit" was disabled. Now enabled. Port 0 through 7, all "not present". Also enabled "EMS Console Redirection", "Out-of-Band Mgmt Port COM2/SOL". Maybe can see more over IPMI. On IPMI SOL I could see the BIOS screen but maybe we had that already in before. I plugged in storage slot 0, see blue led, but both SATA and SCU port 0 show "not present". Also plugged in port 11, see red led. "not present". Restarting system and entering setup again. Still no slots show up. Exited setup and even after 5m of waiting the screen outside setup is just black and machine does not respond. I entered setup again and disabled the SKU controller again and rebooted (13:14Z), 1320Z not up. Disabled "EMS Console Redirection" now, 1323Z. Also machine beeps, like in #114893. The two new devices are currently not connected. Now it's priority to bring back the machine as-is. The two trays with the new disks I for now put in the storage cabinet. No luck to bring the machine back up so far.

Actions #5

Updated by okurz 3 months ago

  • Related to action #170026: [QA][tools][monitor] monitor.qa.suse.de is down added
Actions #6

Updated by okurz 3 months ago

  • Priority changed from Normal to Urgent
Actions #7

Updated by okurz 3 months ago · Edited

  • Status changed from New to In Progress
  • Priority changed from Urgent to High

On boot I could press ctrl-h and reach the RAID controller firmware menu. From there found "foreign config" for DG0 and DG1 but "unconfigured" for slot 10+11 which are also the one showing up with a red light. At least boot to root should work with this. Exited, reboot, entered again, verified valid config to this point. But after another reboot I can still not boot from local disk. nicksinger has enabled network boot and PXE+EFI. System ends up in EFI shell. Need to exit with command "exit". The network boot with DHCP was showing many times. Eventually I booted a Tumbleweed system with ttyS1 which showed output on IPMI SoL.
The good thing is: I'm in a live Tumbleweed system and can confirm that both the root partition as well as the VM data partition is fully usable. Interesting is that we have two 2TB disks which were apparently configured for RAID0 but not used in the past years(?). And the system does not boot up yet but I am relieved so far.
Reconfigured boot settings with nicksinger. System came up again. VMs are running. So at least recovered up to that point. Things to do:

  1. iPXE should also display on local console, e.g. add console=tty1 console=ttyS1 (or the other way around) to display also something on local screen, not just remotely -> #173344
  2. Create backup of backup and VMs, config, jenkins, etc. -> #173347
  3. Migrate VMs to modern hypervisor solution, e.g. openplatform -> #173350
  4. Physically label slot 10+11 -> #173353
  5. Bring slot 10+11 into use, maybe at best software RAID0 or RAID1, not hardware RAID
  6. check if settings "console redirection EMS" helps us, e.g. to mirror more output to physical monitor and SoL
  7. Document that KVMViewer can output VGA whereas IPMI SoL only serial (is that right?) -> https://gitlab.suse.de/suse/wiki/-/merge_requests/5
Actions #8

Updated by openqa_review 3 months ago

  • Due date set to 2024-12-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by okurz 3 months ago

  • Status changed from In Progress to New
Actions #10

Updated by okurz 3 months ago

  • Subject changed from Put more storage into qamaster "to make our lives easier in general" to Put more storage into qamaster "to make our lives easier in general" size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Parent task set to #161414
Actions #11

Updated by okurz 3 months ago

  • Copied to action #173344: Extend iPXE in qe/oqa.*.suse.org to also display on local console size:S added
Actions #12

Updated by okurz 3 months ago

  • Copied to action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added
Actions #13

Updated by okurz 3 months ago

  • Due date deleted (2024-12-07)
  • Status changed from Workable to Blocked
  • Priority changed from High to Normal
Actions #14

Updated by okurz 3 months ago

backup was done. I would like us to work on #173344 first which can really save some precious minutes when hot issues needing local intervention arises. Added #173344 to the backlog

Actions #15

Updated by livdywan about 2 months ago

  • Status changed from Blocked to Workable

Looks to me like all relevant blockers were resolved.

Actions #16

Updated by okurz about 2 months ago

/opt/MegaRAID/storcli/storcli64 /c0 show shows

PD LIST :
=======

------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model            Sp Type 
------------------------------------------------------------------------------
0:0       6 Onln   0 278.875 GB SAS  HDD N   N  512B ST9300605SS      U  -    
0:1       3 Onln   0 278.875 GB SAS  HDD N   N  512B ST9300605SS      U  -    
0:2       1 Onln   0 278.875 GB SAS  HDD N   N  512B ST9300605SS      U  -    
0:3       5 Onln   0 278.875 GB SAS  HDD N   N  512B ST9300605SS      U  -    
0:4      14 Onln   1   931.0 GB SATA HDD Y   N  512B ST91000642NS     U  -    
0:5       4 Onln   1   931.0 GB SAS  HDD N   N  512B ST91000640SS     U  -    
0:6       8 Onln   1   931.0 GB SAS  HDD N   N  512B ST91000640SS     U  -    
0:7      10 Onln   1   931.0 GB SAS  HDD N   N  512B ST91000640SS     U  -    
0:8      13 Onln   1   931.0 GB SAS  HDD Y   N  512B ST91000642SS     U  -    
0:9       7 Onln   1   931.0 GB SAS  HDD N   N  512B ST91000640SS     U  -    
0:10     11 UBad   -   1.818 TB SATA HDD N   N  4 KB ST2000NX0243     U  -    
0:11     12 UBad   -   1.818 TB SATA HDD N   N  4 KB ST2000NX0243     U  -    
------------------------------------------------------------------------------

note the last two devices in state "UBad". I suggest to configure both those devices to be passed through individually to the OS and use both devices for more space.

Actions #17

Updated by okurz about 1 month ago

  • Status changed from Workable to In Progress
sudo /opt/MegaRAID/storcli/storcli64 /c0 /e0 /s10 set good
sudo /opt/MegaRAID/storcli/storcli64 /c0 /e0 /s11 set good
Actions #18

Updated by okurz about 1 month ago

  • Status changed from In Progress to Resolved
sudo /opt/MegaRAID/storcli/storcli64 /c0 /fall import
sudo /opt/MegaRAID/storcli/storcli64 /c0 /e0 /sall show rebuild

shows all as "Not in progress"

lsblk shows a free 3.6T in /dev/sdc. I checked the filesystem on /dev/sdc1 and it was a broken XFS filesystem, likely not recoverable but also not important. I removed the partition and created a btrfs filesystem and added an according mountpoint with

mkfs.btrfs -f /dev/sdc
mkdir -p /srv/storage
echo 'UUID=71acb0e1-9dc8-40a8-a539-226053a33c4d /srv/storage btrfs defaults 0 0' >> /etc/fstab
mount -a

so additional 3.7TB of free space are available in /srv/storage. It's a not so safe RAID0 of rather old disks so I advise to use it as backup target or non-critical data.

Actions

Also available in: Atom PDF