Project

General

Profile

Actions

action #128417

closed

[alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again size:M

Added by nicksinger over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

On 2023-04-28 16:30 the partition usage of w5-xen skyrocketed to >90% (https://stats.openqa-monitor.qa.suse.de/d/GDopenqaw5-xen/dashboard-for-openqaw5-xen?orgId=1&viewPanel=65090&from=1682657429086&to=1682699823248) and quickly after a alert was fired. Someone or something cleaned up a short time after to a reasonable 40% usage.

Suggestions

  • DONE: Check with e.g. @okurz if this was maybe a one-time thing because somebody moved around stuff manually
  • DONE: Manual cleanup of files in /var/lib/libvirt/images, ask in #eng-testing what the stuff is needed for
  • Plug in more SSDs. Likely we have some spare in FC Basement shelves
  • Check virsh XMLs to crosscheck openQA jobs before deleting anything for good
  • Adjust the alert to allow longer periods over the threshold We decided that our thresholds are feasible

Related issues 1 (1 open0 closed)

Related to openQA Infrastructure - action #128222: [virtualization] The Xen specific host configuration on openqaw5-xen can be re-created from salt size:MNew2023-04-24

Actions
Actions #1

Updated by okurz over 1 year ago

  • Tags set to infra
Actions #2

Updated by nicksinger over 1 year ago

  • Subject changed from [alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again - alert adjustement needed? to [alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by nicksinger over 1 year ago

  • Related to action #128222: [virtualization] The Xen specific host configuration on openqaw5-xen can be re-created from salt size:M added
Actions #4

Updated by okurz over 1 year ago

  • Priority changed from Normal to High
Actions #5

Updated by nicksinger over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #6

Updated by nicksinger over 1 year ago

Asked testers in https://suse.slack.com/archives/C02CANHLANP/p1683031962915279 if anybody still uses the old assets we found in /var/lib/libvirt/images. No answer yet but the user might give away who created those:

-rw-r--r-- 1 coolo nogroup 445M 28. Apr 02:13 SLE-15-SP4-Online-x86_64-Build183.14-Media1.iso
-rw-r--r-- 1 coolo nogroup 445M 28. Apr 21:40 SLE-15-SP4-Online-x86_64-Build186.1-Media1.iso
-rw-r--r-- 1 root  root    5,2G 24. Jul 2018  SLES-12-SP1-x86_64-xen-pv-svirt-allpatterns.qcow2_un
-rw-r--r-- 1 root  root    805M 23. Okt 2018  SLES12-SP4-JeOS.x86_64-12.4-VMware-Build10.7.vmdk
-rw-r--r-- 1 coolo nogroup 286M 27. Sep 2021  SLES12-SP5-JeOS.x86_64-12.5-kvm-and-xen-GM.qcow2
-rw-r--r-- 1 coolo nogroup 286M 12. Mai 2022  SLES12-SP5-JeOS.x86_64-12.5-XEN-GM.qcow2
-rw-r--r-- 1 coolo nogroup 1,8G 20. Sep 2019  SLES15-SP1-JeOS.aarch64-15.1-RaspberryPi-Build36.2.5.raw
-rw-r--r-- 1 coolo nogroup 335M 20. Sep 2019  SLES15-SP1-JeOS.aarch64-15.1-RaspberryPi-Build36.2.5.raw.xz
-rw-r--r-- 1 coolo nogroup  123 11. Sep 2020  SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-Build15.36.qcow2.sha256
-rw-r--r-- 1 coolo nogroup  481 11. Sep 2020  SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-Build15.36.qcow2.sha256.asc
-rw-r--r-- 1 coolo nogroup 234M 27. Sep 2021  SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-QU3.qcow2
-rw-r--r-- 1 coolo nogroup 239M 27. Sep 2021  SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2
-rw-r--r-- 1 coolo nogroup 288M 21. Apr 01:33 SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-Build3.9.28.qcow2
-rw-r--r-- 1 coolo nogroup 288M 28. Apr 01:28 SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-Build3.9.30.qcow2
-rw-r--r-- 1 coolo nogroup 283M 12. Mai 2022  SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-GM.qcow2

I moved these files now to /root/poo128417_BACKUP freeing close to 10G.

Actions #7

Updated by openqa_review over 1 year ago

  • Due date set to 2023-05-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by nicksinger over 1 year ago

  • Description updated (diff)
Actions #9

Updated by nicksinger over 1 year ago

found a generic cleanup script in roots crontab located at /usr/local/bin/cleanup-openqa-assets:

#!/bin/sh -e

if [[ $(df | grep "/var/lib/libvirt/images" | awk '{ print $5 }' | tr -d '%\n') -gt 70 ]] ; then
    find /var/lib/libvirt/images/*.{qcow2,iso,img,xml,qcow2.xz} -mtime +0 ! -exec fuser -s "{}" 2>/dev/null \; -exec rm -fv {} \;
fi

but I have no clue yet what creates these files in the first place.

Actions #10

Updated by okurz over 1 year ago

  • Assignee changed from nicksinger to okurz

Adding more disks

Actions #11

Updated by okurz over 1 year ago

  • Assignee changed from okurz to nicksinger

The server cas has 8 enclosures for drives. I called for i in sda sdb sdc ; do hdparm -t /dev/$i; done to identify the existing physical drives that the OS detects. Those enclosures are labeled 0, 1, 2. So 3 in the lower row bottom right as well as all four enclosures in the top row seem to be empty. Those enclosures seems to have some kind of beige plastic dummy spacer. I can't mount the SSDs in those directly but I can hackily plug the SSDs into the backplane without using the enclosure at all. I called echo "- - -" | tee /sys/class/scsi_host/host*/scan but lsblk did not show a new device. I plugged in four SSDs and now at least "sdd" shows up but not more. The upper row seems to not be directly usable. The bottom right one is "sdd". I couldn't find a way to use drives on the upper row lsscsi returns all four devices seemingly connected to the "first" (?) controller. lshw -class storage says there are two controller but maybe there are effectively the same. Anyway, then let's continue with just the one additional device.

wipefs -a /dev/sdd
pvcreate /dev/sdd
vgextend openqa_vg /dev/sdd
lvextend -l +100%FREE /dev/openqa_vg/openqa_lv
resize2fs /dev/openqa_vg/openqa_lv 

so at least we now have:

/dev/mapper/openqa_vg-openqa_lv                         331G  131G  199G  40% /var/lib/libvirt/images

I will leave the additional SSDs next to openqaw5-xen for further use.

@nicksinger to continue with moving assets to a different storage and provide bind mount into a subfolder.

Actions #12

Updated by okurz over 1 year ago

  • Due date deleted (2023-05-17)
  • Status changed from In Progress to Resolved

We looked into this together and decided it's too much effort for us now to improve how assets are handled. Nevertheless with the new space we now have +100G and current usage at 32% so much headroom until the next alert hits us.

Actions

Also available in: Atom PDF