Project

General

Profile

Actions

action #177159

open

[alert] Disk `/dev/dasda2` (the btrfs root filesystem) is quite full (over 80 %) on `s390zl12.oqa.prg2.suse.org` size: S

Added by mkittler 9 days ago. Updated 1 day ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2025-02-13
Due date:
2025-03-06 (Due in 12 days)
% Done:

0%

Estimated time:

Description

Observation

This was problematic in the past, see #173947. I had a brief look on s390zl12.oqa.prg2.suse.org but couldn't find much I could easily remove.

The caused an alert when the disk usage was for two hours at 88 %, see https://monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&viewPanel=panel-65090&from=2025-02-13T08:30:31.456Z&to=2025-02-13T09:30:26.604Z&timezone=browser&var-datasource=000000001&refresh=1m. Since disk usage is now back at 82 % it isn't clear what caused this.

Acceptance Criteria

  • AC1: The disk usage is considerably below the 80% alert threshold

Suggestions

  • Use a bigger disk which is possible because we have a virtual device but 40GB should actually be enough for a special purpose OS instance
  • Limit space used by snapshots … if snapshots actually are the culprit
  • As this is about the root filesystem and we have a separate one for /var/lib/libvirt/images it should be certainly feasible to reach well below 40GB with the root f/s. Just use btrfs fi du / $something or variants to find out where we loose the space and cleanup
  • Re-run commands from #173947#note-8 again and try to make sense of the output.

Files

Actions #1

Updated by mkittler 9 days ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by mkittler 9 days ago

  • Description updated (diff)
Actions #3

Updated by mkittler 9 days ago

  • Priority changed from Urgent to High
Actions #4

Updated by okurz 8 days ago

  • Tags changed from alert, reactive work to alert, reactive work, infra, s390x
  • Category set to Regressions/Crashes
Actions #5

Updated by robert.richardson 4 days ago

  • Subject changed from [alert] Disk `/dev/dasda2` (the btrfs root filesystem) is quite full (over 80 %) on `s390zl12.oqa.prg2.suse.org` to [alert] Disk `/dev/dasda2` (the btrfs root filesystem) is quite full (over 80 %) on `s390zl12.oqa.prg2.suse.org` size: S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz 4 days ago

  • Priority changed from High to Normal
Actions #7

Updated by okurz 3 days ago

  • Priority changed from Normal to High
Actions #8

Updated by nicksinger 3 days ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #9

Updated by nicksinger 3 days ago

So initial screening of the machine shows:

s390zl12:~ # mount -o subvolid=5 /dev/dasda2 /mnt/btrfs/
s390zl12:~ # du -sh /mnt/btrfs/*
89G /mnt/btrfs/@
s390zl12:~ # du -sh /mnt/btrfs/@/*
15M /mnt/btrfs/@/boot
4.0K    /mnt/btrfs/@/etc
2.4M    /mnt/btrfs/@/home
0   /mnt/btrfs/@/opt
184K    /mnt/btrfs/@/root
0   /mnt/btrfs/@/srv
4.1M    /mnt/btrfs/@/tmp
4.0K    /mnt/btrfs/@/usr
21G /mnt/btrfs/@/var

s390zl12:/mnt/btrfs/@/.snapshots # btrfs filesystem du -s *
     Total   Exclusive  Set shared  Filename
   5.44GiB    32.00KiB     5.44GiB  400
   5.54GiB   229.93MiB     5.32GiB  646
   5.45GiB   132.00KiB     5.45GiB  647
   5.45GiB     4.00KiB     5.45GiB  648
   5.53GiB   144.00KiB     5.53GiB  649
   5.53GiB    92.00KiB     5.53GiB  650
   5.53GiB     8.16MiB     5.52GiB  651
   5.53GiB   152.00KiB     5.53GiB  652
   5.53GiB       0.00B     5.53GiB  653
   5.44GiB   132.00KiB     5.44GiB  654
   5.44GiB    28.00KiB     5.44GiB  655
   5.44GiB    44.00KiB     5.44GiB  656
     0.00B       0.00B       0.00B  grub-snapshot.cfg

so the biggest snapshot only needs 229.93MiB. However, the var-subvolume looks rather big. Checking on the live-system I can see:

s390zl12:/mnt/btrfs/@ # du -shx /var
4.9G    /var

and following these crumbs I find that /mnt/btrfs/@/var/lib/libvirt/images/ uses 17G. So we have libvirt images on the root disk which are supposed to reside on a separate disk/partition. I will clean them up and check if I can improve boot-dependencies.

Actions #10

Updated by nicksinger 3 days ago · Edited

Just for completeness, a list of these old files:

s390zl12:/mnt/btrfs/@/var/lib/libvirt/images # ls -lah
total 17G
drwx--x--x 1 root root  752 Sep  3 12:20 .
drwxr-xr-x 1 root root   88 Jun 27  2024 ..
-rw-r--r-- 1 qemu qemu 5.1G Sep  3 12:24 openQA-SUT-12a.img
-rw-r--r-- 1 qemu qemu  56M Sep  3 12:03 openQA-SUT-12.initrd
-rw-r--r-- 1 qemu qemu 8.0M Sep  3 12:03 openQA-SUT-12.kernel
-rw-r--r-- 1 root root 1.5K Sep  3 12:03 openQA-SUT-12.xml
-rw-r--r-- 1 qemu qemu 2.7G Sep  3 12:11 openQA-SUT-14a.img
-rw-r--r-- 1 qemu qemu  48M Sep  3 12:03 openQA-SUT-14.initrd
-rw-r--r-- 1 qemu qemu 7.9M Sep  3 12:03 openQA-SUT-14.kernel
-rw-r--r-- 1 root root 1.6K Sep  3 12:03 openQA-SUT-14.xml
-rw-r--r-- 1 root root 2.8G Sep  3 12:19 openQA-SUT-17a.img
-rw-r--r-- 1 root root  56M Sep  3 12:03 openQA-SUT-17.initrd
-rw-r--r-- 1 root root 8.0M Sep  3 12:03 openQA-SUT-17.kernel
-rw-r--r-- 1 root root 1.6K Sep  3 12:03 openQA-SUT-17.xml
-rw-r--r-- 1 root root 2.7G Sep  3 12:19 openQA-SUT-18a.img
-rw-r--r-- 1 root root  48M Sep  3 12:03 openQA-SUT-18.initrd
-rw-r--r-- 1 root root 7.9M Sep  3 12:03 openQA-SUT-18.kernel
-rw-r--r-- 1 root root 1.6K Sep  3 12:03 openQA-SUT-18.xml
-rw-r--r-- 1 root root 264M Sep  3 12:20 supp_sles15sp4_updatestack-s390x.qcow2
-rw-r--r-- 1 root root 2.6G Sep  3 12:20 supp_sles15sp5_updatestack-s390x.qcow2

All of this is auto-generated and rather old -> trash

Actions #12

Updated by nicksinger 3 days ago

  • Status changed from Feedback to In Progress
  • Priority changed from High to Normal

My changes only order the unit and do not require the mount point to be present. A related discussion can be found in Slack. I'm looking into possible solutions

Actions #13

Updated by openqa_review 3 days ago

  • Due date set to 2025-03-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by nicksinger 1 day ago

  • Status changed from In Progress to Feedback

My MR now includes management of these storage partitions in /etc/fstab and more complex interaction between the mount-unit and the libvirtd.service. To not break both workers at the same time automatically I removed the entry from top.sls in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1380 and just introduce the state first. After this is merged, I can test with state.apply libvirt.storage on a single host and only introduce it once everything is working as expected.

Actions

Also available in: Atom PDF