action #162725
closedcoordination #162716: [epic] Better use of storage on OSD workers
After w40 reconsider storage use for other OSD workers size:S
0%
Description
Motivation¶
See #162719
Acceptance criteria¶
- AC1: All PRG2 x86_64 OSD workers have significantly more space than 500G for pool+cache combined using the existing physical storage devices
Acceptance tests¶
-
AT1-1:
ssh osd sudo salt --no-color -C 'G@roles:worker and G@osarch:x86_64' cmd.run 'df -h /var/lib/openqa'
shows > 500G for all PRG2 workers
Suggestions¶
- Review what was done in #162719-15 manually, consider to change mount points using the salt states in https://gitlab.suse.de/openqa/salt-states-openqa/-/tree/master/openqa/nvme_store?ref_type=heads that prepare devices accordingly. If not feasible then apply the same either manually to all machines
Out of scope¶
Ordering any new physical storage devices
Updated by okurz 11 months ago
- Copied from action #162719: Ensure w40 has more space for worker pool directories size:S added
Updated by gpathak 8 months ago
Seems like the AC and AT is already fulfilled:
openqaworker17.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 3.5T 617G 2.7T 19% /var/lib/openqa
openqaworker18.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 3.5T 618G 2.7T 19% /var/lib/openqa
openqaworker16.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 3.5T 611G 2.7T 19% /var/lib/openqa
worker36.oqa.prg2.suse.org:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 57G 835G 7% /var/lib/openqa
worker39.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 55G 837G 7% /var/lib/openqa
worker32.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 55G 836G 7% /var/lib/openqa
worker30.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 62G 830G 7% /var/lib/openqa
qesapworker-prg5.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 65G 14T 1% /var/lib/openqa
worker33.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 62G 830G 7% /var/lib/openqa
worker31.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 65G 827G 8% /var/lib/openqa
worker29.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 79G 813G 9% /var/lib/openqa
qesapworker-prg4.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 66G 14T 1% /var/lib/openqa
openqaworker14.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 2.5T 582G 1.8T 25% /var/lib/openqa
qesapworker-prg7.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 66G 14T 1% /var/lib/openqa
worker40.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 6.3T 55G 5.9T 1% /var/lib/openqa
worker35.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 59G 832G 7% /var/lib/openqa
worker34.oqa.prg2.suse.org: <--
Filesystem Size Used Avail Use% Mounted on
/dev/md127 939G 61G 830G 7% /var/lib/openqa
qesapworker-prg6.qa.suse.cz:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 59G 14T 1% /var/lib/openqa
sapworker1.qe.nue2.suse.org:
Filesystem Size Used Avail Use% Mounted on
/dev/md127 14T 62G 14T 1% /var/lib/openqa
The above output is from the command sudo salt --no-color -C 'G@roles:worker and G@osarch:x86_64' cmd.run 'df -h /var/lib/openqa'
executed on OSD
Updated by okurz 8 months ago
- Copied to action #168301: After w40 related problems reconsider storage use for all PRG2 based OSD workers added
Updated by okurz 8 months ago
Good check! Yeah, that's true. It seems what was overlooked as part of #162719 is that w40 probably temporarily lost the connection to one of it's NVMes. Right now lsblk
shows
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme1n1 259:0 0 476.9G 0 disk
└─md127 9:127 0 6.3T 0 raid0 /var/lib/openqa
nvme2n1 259:1 0 476.9G 0 disk
├─nvme2n1p1 259:2 0 512M 0 part /boot/efi
├─nvme2n1p2 259:3 0 293G 0 part /
…
nvme0n1 259:6 0 5.8T 0 disk
└─md127 9:127 0 6.3T 0 raid0 /var/lib/openqa
so a RAID0 is constructed between a 500GiB and a 6TiB device which does not make much sense to me. So after all I think the approach from dheidler in #162719 was not enough as it was not addressing the real problem. However that we should handle in a separate dedicated ticket which I now did with #168301. Feel welcome to pick up and resolve this ticket then as you did all what was necessary :)