action #162725
closedcoordination #162716: [epic] Better use of storage on OSD workers
After w40 reconsider storage use for other OSD workers size:S
0%
Description
Motivation¶
See #162719
Acceptance criteria¶
- AC1: All PRG2 x86_64 OSD workers have significantly more space than 500G for pool+cache combined using the existing physical storage devices
Acceptance tests¶
- AT1-1:
ssh osd sudo salt --no-color -C 'G@roles:worker and G@osarch:x86_64' cmd.run 'df -h /var/lib/openqa'
shows > 500G for all PRG2 workers
Suggestions¶
- Review what was done in #162719-15 manually, consider to change mount points using the salt states in https://gitlab.suse.de/openqa/salt-states-openqa/-/tree/master/openqa/nvme_store?ref_type=heads that prepare devices accordingly. If not feasible then apply the same either manually to all machines
Out of scope¶
Ordering any new physical storage devices
Updated by okurz 6 months ago
- Copied from action #162719: Ensure w40 has more space for worker pool directories size:S added
Updated by gpathak 2 months ago
Seems like the AC and AT is already fulfilled:
openqaworker17.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 3.5T 617G 2.7T 19% /var/lib/openqa
openqaworker18.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 3.5T 618G 2.7T 19% /var/lib/openqa
openqaworker16.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 3.5T 611G 2.7T 19% /var/lib/openqa
worker36.oqa.prg2.suse.org:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 57G 835G 7% /var/lib/openqa
worker39.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 55G 837G 7% /var/lib/openqa
worker32.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 55G 836G 7% /var/lib/openqa
worker30.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 62G 830G 7% /var/lib/openqa
qesapworker-prg5.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 65G 14T 1% /var/lib/openqa
worker33.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 62G 830G 7% /var/lib/openqa
worker31.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 65G 827G 8% /var/lib/openqa
worker29.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 79G 813G 9% /var/lib/openqa
qesapworker-prg4.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 66G 14T 1% /var/lib/openqa
openqaworker14.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 2.5T 582G 1.8T 25% /var/lib/openqa
qesapworker-prg7.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 66G 14T 1% /var/lib/openqa
worker40.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 6.3T 55G 5.9T 1% /var/lib/openqa
worker35.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 59G 832G 7% /var/lib/openqa
worker34.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 61G 830G 7% /var/lib/openqa
qesapworker-prg6.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 59G 14T 1% /var/lib/openqa
sapworker1.qe.nue2.suse.org:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 62G 14T 1% /var/lib/openqa
The above output is from the command sudo salt --no-color -C 'G@roles:worker and G@osarch:x86_64' cmd.run 'df -h /var/lib/openqa'
executed on OSD
Updated by okurz 2 months ago
- Copied to action #168301: After w40 related problems reconsider storage use for all PRG2 based OSD workers added
Updated by okurz 2 months ago
Good check! Yeah, that's true. It seems what was overlooked as part of #162719 is that w40 probably temporarily lost the connection to one of it's NVMes. Right now lsblk
shows
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme1n1 259:0 0 476.9G 0 disk
└─md127 9:127 0 6.3T 0 raid0 /var/lib/openqa
nvme2n1 259:1 0 476.9G 0 disk
├─nvme2n1p1 259:2 0 512M 0 part /boot/efi
├─nvme2n1p2 259:3 0 293G 0 part /
…
nvme0n1 259:6 0 5.8T 0 disk
└─md127 9:127 0 6.3T 0 raid0 /var/lib/openqa
so a RAID0 is constructed between a 500GiB and a 6TiB device which does not make much sense to me. So after all I think the approach from dheidler in #162719 was not enough as it was not addressing the real problem. However that we should handle in a separate dedicated ticket which I now did with #168301. Feel welcome to pick up and resolve this ticket then as you did all what was necessary :)