action #128417: [alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #128417

closed

[alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again size:M

Added by nicksinger over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

On 2023-04-28 16:30 the partition usage of w5-xen skyrocketed to >90% (https://stats.openqa-monitor.qa.suse.de/d/GDopenqaw5-xen/dashboard-for-openqaw5-xen?orgId=1&viewPanel=65090&from=1682657429086&to=1682699823248) and quickly after a alert was fired. Someone or something cleaned up a short time after to a reasonable 40% usage.

Suggestions¶

DONE: Check with e.g. @okurz if this was maybe a one-time thing because somebody moved around stuff manually
DONE: Manual cleanup of files in /var/lib/libvirt/images, ask in #eng-testing what the stuff is needed for
Plug in more SSDs. Likely we have some spare in FC Basement shelves
Check virsh XMLs to crosscheck openQA jobs before deleting anything for good
~~Adjust the alert to allow longer periods over the threshold~~ We decided that our thresholds are feasible

Related issues 1 (1 open — 0 closed)

Related to openQA Infrastructure (public) - action #128222: [virtualization] The Xen specific host configuration on openqaw5-xen can be re-created from salt size:M

New

2023-04-24

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 1 year ago

Tags set to infra

Actions

Copy link

Updated by nicksinger over 1 year ago

Subject changed from [alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again - alert adjustement needed? to [alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by nicksinger over 1 year ago

Related to action #128222: [virtualization] The Xen specific host configuration on openqaw5-xen can be re-created from salt size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Priority changed from Normal to High

Actions

Copy link

Updated by nicksinger over 1 year ago

Status changed from Workable to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger over 1 year ago

Asked testers in https://suse.slack.com/archives/C02CANHLANP/p1683031962915279 if anybody still uses the old assets we found in /var/lib/libvirt/images. No answer yet but the user might give away who created those:

-rw-r--r-- 1 coolo nogroup 445M 28. Apr 02:13 SLE-15-SP4-Online-x86_64-Build183.14-Media1.iso
-rw-r--r-- 1 coolo nogroup 445M 28. Apr 21:40 SLE-15-SP4-Online-x86_64-Build186.1-Media1.iso
-rw-r--r-- 1 root  root    5,2G 24. Jul 2018  SLES-12-SP1-x86_64-xen-pv-svirt-allpatterns.qcow2_un
-rw-r--r-- 1 root  root    805M 23. Okt 2018  SLES12-SP4-JeOS.x86_64-12.4-VMware-Build10.7.vmdk
-rw-r--r-- 1 coolo nogroup 286M 27. Sep 2021  SLES12-SP5-JeOS.x86_64-12.5-kvm-and-xen-GM.qcow2
-rw-r--r-- 1 coolo nogroup 286M 12. Mai 2022  SLES12-SP5-JeOS.x86_64-12.5-XEN-GM.qcow2
-rw-r--r-- 1 coolo nogroup 1,8G 20. Sep 2019  SLES15-SP1-JeOS.aarch64-15.1-RaspberryPi-Build36.2.5.raw
-rw-r--r-- 1 coolo nogroup 335M 20. Sep 2019  SLES15-SP1-JeOS.aarch64-15.1-RaspberryPi-Build36.2.5.raw.xz
-rw-r--r-- 1 coolo nogroup  123 11. Sep 2020  SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-Build15.36.qcow2.sha256
-rw-r--r-- 1 coolo nogroup  481 11. Sep 2020  SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-Build15.36.qcow2.sha256.asc
-rw-r--r-- 1 coolo nogroup 234M 27. Sep 2021  SLES15-SP2-JeOS.x86_64-15.2-kvm-and-xen-QU3.qcow2
-rw-r--r-- 1 coolo nogroup 239M 27. Sep 2021  SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-GM.qcow2
-rw-r--r-- 1 coolo nogroup 288M 21. Apr 01:33 SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-Build3.9.28.qcow2
-rw-r--r-- 1 coolo nogroup 288M 28. Apr 01:28 SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-Build3.9.30.qcow2
-rw-r--r-- 1 coolo nogroup 283M 12. Mai 2022  SLES15-SP4-Minimal-VM.x86_64-kvm-and-xen-GM.qcow2

I moved these files now to /root/poo128417_BACKUP freeing close to 10G.

Actions

Copy link

Updated by openqa_review over 1 year ago

Due date set to 2023-05-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by nicksinger over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by nicksinger over 1 year ago

found a generic cleanup script in roots crontab located at /usr/local/bin/cleanup-openqa-assets:

#!/bin/sh -e

if [[ $(df | grep "/var/lib/libvirt/images" | awk '{ print $5 }' | tr -d '%\n') -gt 70 ]] ; then
    find /var/lib/libvirt/images/*.{qcow2,iso,img,xml,qcow2.xz} -mtime +0 ! -exec fuser -s "{}" 2>/dev/null \; -exec rm -fv {} \;
fi

but I have no clue yet what creates these files in the first place.

Actions

Copy link

#10

Updated by okurz over 1 year ago

Assignee changed from nicksinger to okurz

Adding more disks

Actions

Copy link

#11

Updated by okurz over 1 year ago

Assignee changed from okurz to nicksinger

The server cas has 8 enclosures for drives. I called for i in sda sdb sdc ; do hdparm -t /dev/$i; done to identify the existing physical drives that the OS detects. Those enclosures are labeled 0, 1, 2. So 3 in the lower row bottom right as well as all four enclosures in the top row seem to be empty. Those enclosures seems to have some kind of beige plastic dummy spacer. I can't mount the SSDs in those directly but I can hackily plug the SSDs into the backplane without using the enclosure at all. I called echo "- - -" | tee /sys/class/scsi_host/host*/scan but lsblk did not show a new device. I plugged in four SSDs and now at least "sdd" shows up but not more. The upper row seems to not be directly usable. The bottom right one is "sdd". I couldn't find a way to use drives on the upper row lsscsi returns all four devices seemingly connected to the "first" (?) controller. lshw -class storage says there are two controller but maybe there are effectively the same. Anyway, then let's continue with just the one additional device.

wipefs -a /dev/sdd
pvcreate /dev/sdd
vgextend openqa_vg /dev/sdd
lvextend -l +100%FREE /dev/openqa_vg/openqa_lv
resize2fs /dev/openqa_vg/openqa_lv

so at least we now have:

/dev/mapper/openqa_vg-openqa_lv                         331G  131G  199G  40% /var/lib/libvirt/images

I will leave the additional SSDs next to openqaw5-xen for further use.

@nicksinger to continue with moving assets to a different storage and provide bind mount into a subfolder.

Actions

Copy link

#12

Updated by okurz over 1 year ago

Due date deleted (~~2023-05-17~~)
Status changed from In Progress to Resolved

We looked into this together and decided it's too much effort for us now to improve how assets are handled. Nevertheless with the new space we now have +100G and current usage at 32% so much headroom until the next alert hits us.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #128417

[alert][grafana] openqaw5-xen: partitions usage (%) alert fired and quickly after recovered again size:M

Observation¶

Suggestions¶

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by openqa_review over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago