Project

General

Profile

Actions

action #162719

closed

coordination #162716: [epic] Better use of storage on OSD workers

Ensure w40 has more space for worker pool directories size:S

Added by okurz 26 days ago. Updated about 2 hours ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-06-21
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

w40 ran out of space in /var/lib/openqa despite having another partition with multiple TB free space. We should reconsider the choices we made for setting up OSD PRG2 workers.

# lsblk 
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
nvme0n1     259:1    0   5.8T  0 disk  
├─nvme0n1p1 259:2    0   512M  0 part  /boot/efi
├─nvme0n1p2 259:3    0   5.8T  0 part  /var
…
│                                      /
└─nvme0n1p3 259:4    0     1G  0 part  [SWAP]
nvme2n1     259:5    0 476.9G  0 disk  
└─md127       9:127  0 476.8G  0 raid0 /var/lib/openqa
# hdparm -tT /dev/nvme?n1

/dev/nvme0n1:
 Timing cached reads:   30178 MB in  1.99 seconds = 15202.23 MB/sec
 Timing buffered disk reads: 6360 MB in  3.00 seconds = 2120.00 MB/sec

/dev/nvme2n1:
 Timing cached reads:   33204 MB in  1.98 seconds = 16739.11 MB/sec
 Timing buffered disk reads: 8478 MB in  3.00 seconds = 2825.74 MB/sec

nvme2n1 seems to be 30% faster but is more limited in space.

Acceptance criteria

  • AC1: w40 has significantly more space than 500G for pool+cache combined

Suggestions

Out of scope

Rollback steps


Related issues 2 (2 open0 closed)

Related to openQA Infrastructure - action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:SBlockedokurz2024-06-20

Actions
Copied to openQA Infrastructure - action #162725: After w40 reconsider storage use for other OSD workersNew2024-06-21

Actions
Actions #1

Updated by okurz 26 days ago

  • Tracker changed from coordination to action
  • Project changed from QA to openQA Infrastructure
Actions #2

Updated by okurz 26 days ago

  • Copied to action #162725: After w40 reconsider storage use for other OSD workers added
Actions #3

Updated by okurz 22 days ago

  • Description updated (diff)
Actions #4

Updated by okurz 14 days ago

  • Target version changed from future to Ready
Actions #5

Updated by okurz 14 days ago

  • Subject changed from Ensure w40 has more space for worker pool directories to Ensure w40 has more space for worker pool directories size:S
  • Description updated (diff)
  • Category set to Feature requests
  • Status changed from New to Workable
Actions #6

Updated by okurz 14 days ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

Actually because w40 is critical we should block on #158146

Actions #7

Updated by livdywan 12 days ago

  • Status changed from Blocked to Workable

No longer blocked

Actions #8

Updated by okurz 12 days ago

  • Assignee deleted (okurz)
Actions #9

Updated by okurz 11 days ago

  • Related to action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S added
Actions #10

Updated by nicksinger 8 days ago

  • Assignee set to nicksinger
Actions #11

Updated by nicksinger 5 days ago

  • Assignee deleted (nicksinger)

I'm not currently working on it

Actions #12

Updated by livdywan 2 days ago

  • Priority changed from Normal to High

This is blocking #162596 which is High, consequently this has to be.

Actions #13

Updated by dheidler 1 day ago

  • Assignee set to dheidler
Actions #14

Updated by dheidler 1 day ago

  • Status changed from Workable to In Progress
Actions #15

Updated by dheidler about 24 hours ago · Edited

(using sda for old disk and sdb for new disk for root fs here as it is shorter)

This describes how to move the root fs to the smaller disk (here sdb).
The script from salt will automatically use the other disk for /var/lib/openqa.

  • be aware that there is a difference between partition UUID and filesystem UUID - use blkid to view both.
  • unmounted /var/lib/openqa
  • online-resized the existing btrfs filesystem and partition
  • copied over the data using dd
  • copied over the GPT table using sgdisk (e.g. /dev/sda -R /dev/sdb)
  • generated new part (sgdisk -G /dev/sdb)
  • generated new UUID for new btrfs filesystem: btrfstune -u /dev/sdb2
  • generated new UUID for new vfat EFI partition: mlabel -s -n :: -i /dev/sdb1
  • deal with the swap partition (make sure it is in the right place on the new disk)
  • mount /dev/sdb2 /mnt
  • replace old btrfs UUID with new one in /mnt/etc/fstab and /mnt/boot/grub/grub.cfg
  • umount /boot/efi
  • mount new efi partition mount /dev/sdb1 /boot/efi
  • replace old btrfs UUID with new one in /boot/efi/EFI/opensuse/grub.cfg
  • update bootloader in EFI vars using update-bootloader --install
  • make sure, you got the right boot partition in EFI using bootctl and efibootmgr -v
  • if needed, remove the old boot entry using efibootmgr --delete -b XXXX
  • reboot
Actions #16

Updated by dheidler about 24 hours ago

  • Status changed from In Progress to Feedback

Increase number of worker slots on w40 again.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/866

Actions #17

Updated by dheidler about 2 hours ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF