Project

General

Profile

Actions

action #168301

open

coordination #162716: [epic] Better use of storage on OSD workers

After w40 related problems reconsider storage use for all PRG2 based OSD workers

Added by okurz 2 months ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-06-21
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

See #162719 and #162725. It seems what was overlooked as part of #162719 is that w40 probably temporarily lost the connection to one of it's NVMes. Right now lsblk shows

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
nvme1n1     259:0    0 476.9G  0 disk  
└─md127       9:127  0   6.3T  0 raid0 /var/lib/openqa
nvme2n1     259:1    0 476.9G  0 disk  
├─nvme2n1p1 259:2    0   512M  0 part  /boot/efi
├─nvme2n1p2 259:3    0   293G  0 part  /
…
nvme0n1     259:6    0   5.8T  0 disk  
└─md127       9:127  0   6.3T  0 raid0 /var/lib/openqa

so a RAID0 is constructed between a 500GiB and a 6TiB device which does not make much sense to me. So after all I think the approach from dheidler in #162719 was not enough as it was not addressing the real problem.

Acceptance criteria

  • AC1: All PRG2 x86_64 OSD workers use their existing local storage more efficiently

Acceptance tests

  • AT1-1: ssh osd sudo salt --no-color -C 'G@roles:worker and G@osarch:x86_64' cmd.run 'df -h /var/lib/openqa' shows > 1TB for all PRG2 workers

Suggestions

Out of scope

Ordering any new physical storage devices


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #162725: After w40 reconsider storage use for other OSD workers size:SResolvedgpathak2024-06-21

Actions
Actions #1

Updated by okurz 2 months ago

  • Copied from action #162725: After w40 reconsider storage use for other OSD workers size:S added
Actions #2

Updated by gpathak 2 months ago · Edited

okurz wrote:

so a RAID0 is constructed between a 500GiB and a 6TiB device which does not make much sense to me. So after all I think the approach from dheidler in #162719 was not enough as it was not addressing the real problem.

Seems like we aim to move the root filesystem on all other workers to the smallest storage device (e.g. on w40 either nvme1n1 or nvme2n1, both 477GB), while keeping /var/lib/openqa on the largest available storage (either nvme2n1 or the combination of nvme1n1 and nvme0n1). If this is correct, can we update/add something in the acceptance criteria for clarity?

Actions

Also available in: Atom PDF