action #128417
closed
[alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again size:M
Added by nicksinger over 1 year ago.
Updated over 1 year ago.
Description
Observation¶
On 2023-04-28 16:30 the partition usage of w5-xen skyrocketed to >90% (https://stats.openqa-monitor.qa.suse.de/d/GDopenqaw5-xen/dashboard-for-openqaw5-xen?orgId=1&viewPanel=65090&from=1682657429086&to=1682699823248) and quickly after a alert was fired. Someone or something cleaned up a short time after to a reasonable 40% usage.
Suggestions¶
- DONE: Check with e.g. @okurz if this was maybe a one-time thing because somebody moved around stuff manually
- DONE: Manual cleanup of files in /var/lib/libvirt/images, ask in #eng-testing what the stuff is needed for
- Plug in more SSDs. Likely we have some spare in FC Basement shelves
- Check virsh XMLs to crosscheck openQA jobs before deleting anything for good
Adjust the alert to allow longer periods over the threshold We decided that our thresholds are feasible
Related issues
1 (1 open — 0 closed)
- Subject changed from [alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again - alert adjustement needed? to [alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again size:M
- Description updated (diff)
- Status changed from New to Workable
- Related to action #128222: [virtualization] The Xen specific host configuration on openqaw5-xen can be re-created from salt size:M added
- Priority changed from Normal to High
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Asked testers in https://suse.slack.com/archives/C02CANHLANP/p1683031962915279 if anybody still uses the old assets we found in /var/lib/libvirt/images. No answer yet but the user might give away who created those:
-rw-r--r-- 1 coolo nogroup 445M 28. Apr 02:13 SLE-15-SP4-Online-x86_64-Build183.14-Media1.iso
-rw-r--r-- 1 coolo nogroup 445M 28. Apr 21:40 SLE-15-SP4-Online-x86_64-Build186.1-Media1.iso
-rw-r--r-- 1 root root 5,2G 24. Jul 2018 SLES-12-SP1-x86_64-xen-pv-svirt-allpatterns.qcow2_un
-rw-r--r-- 1 root root 805M 23. Okt 2018 SLES12-SP4-JeOS.x86_64-12.4-VMware-Build10.7.vmdk
-rw-r--r-- 1 coolo nogroup 286M 27. Sep 2021 SLES12-SP5-JeOS.x86_64-12.5-kvm-and-xen-GM.qcow2
-rw-r--r-- 1 coolo nogroup 286M 12. Mai 2022 SLES12-SP5-JeOS.x86_64-12.5-XEN-GM.qcow2
-rw-r--r-- 1 coolo nogroup 1,8G 20. Sep 2019 SLES15-SP1-JeOS.aarch64-15.1-RaspberryPi-Build36.2.5.raw
-rw-r--r-- 1 coolo nogroup 335M 20. Sep 2019 SLES15-SP1-JeOS.aarch64-15.1-RaspberryPi-Build36.2.5.raw.xz
-rw-r--r-- 1 coolo nogroup 123 11. Sep 2020 SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-Build15.36.qcow2.sha256
-rw-r--r-- 1 coolo nogroup 481 11. Sep 2020 SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-Build15.36.qcow2.sha256.asc
-rw-r--r-- 1 coolo nogroup 234M 27. Sep 2021 SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-QU3.qcow2
-rw-r--r-- 1 coolo nogroup 239M 27. Sep 2021 SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2
-rw-r--r-- 1 coolo nogroup 288M 21. Apr 01:33 SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-Build3.9.28.qcow2
-rw-r--r-- 1 coolo nogroup 288M 28. Apr 01:28 SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-Build3.9.30.qcow2
-rw-r--r-- 1 coolo nogroup 283M 12. Mai 2022 SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-GM.qcow2
I moved these files now to /root/poo128417_BACKUP
freeing close to 10G.
- Due date set to 2023-05-17
Setting due date based on mean cycle time of SUSE QE Tools
- Description updated (diff)
found a generic cleanup script in roots crontab located at /usr/local/bin/cleanup-openqa-assets:
#!/bin/sh -e
if [[ $(df | grep "/var/lib/libvirt/images" | awk '{ print $5 }' | tr -d '%\n') -gt 70 ]] ; then
find /var/lib/libvirt/images/*.{qcow2,iso,img,xml,qcow2.xz} -mtime +0 ! -exec fuser -s "{}" 2>/dev/null \; -exec rm -fv {} \;
fi
but I have no clue yet what creates these files in the first place.
- Assignee changed from nicksinger to okurz
- Assignee changed from okurz to nicksinger
The server cas has 8 enclosures for drives. I called for i in sda sdb sdc ; do hdparm -t /dev/$i; done
to identify the existing physical drives that the OS detects. Those enclosures are labeled 0, 1, 2. So 3 in the lower row bottom right as well as all four enclosures in the top row seem to be empty. Those enclosures seems to have some kind of beige plastic dummy spacer. I can't mount the SSDs in those directly but I can hackily plug the SSDs into the backplane without using the enclosure at all. I called echo "- - -" | tee /sys/class/scsi_host/host*/scan
but lsblk did not show a new device. I plugged in four SSDs and now at least "sdd" shows up but not more. The upper row seems to not be directly usable. The bottom right one is "sdd". I couldn't find a way to use drives on the upper row lsscsi
returns all four devices seemingly connected to the "first" (?) controller. lshw -class storage
says there are two controller but maybe there are effectively the same. Anyway, then let's continue with just the one additional device.
wipefs -a /dev/sdd
pvcreate /dev/sdd
vgextend openqa_vg /dev/sdd
lvextend -l +100%FREE /dev/openqa_vg/openqa_lv
resize2fs /dev/openqa_vg/openqa_lv
so at least we now have:
/dev/mapper/openqa_vg-openqa_lv 331G 131G 199G 40% /var/lib/libvirt/images
I will leave the additional SSDs next to openqaw5-xen for further use.
@nicksinger to continue with moving assets to a different storage and provide bind mount into a subfolder.
- Due date deleted (
2023-05-17)
- Status changed from In Progress to Resolved
We looked into this together and decided it's too much effort for us now to improve how assets are handled. Nevertheless with the new space we now have +100G and current usage at 32% so much headroom until the next alert hits us.
Also available in: Atom
PDF