Project

General

Profile

Actions

action #177159

closed

[alert] Disk `/dev/dasda2` (the btrfs root filesystem) is quite full (over 80 %) on `s390zl12.oqa.prg2.suse.org` size:S

Added by mkittler 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2025-02-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

This was problematic in the past, see #173947. I had a brief look on s390zl12.oqa.prg2.suse.org but couldn't find much I could easily remove.

The caused an alert when the disk usage was for two hours at 88 %, see https://monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&viewPanel=panel-65090&from=2025-02-13T08:30:31.456Z&to=2025-02-13T09:30:26.604Z&timezone=browser&var-datasource=000000001&refresh=1m. Since disk usage is now back at 82 % it isn't clear what caused this.

Acceptance Criteria

  • AC1: The disk usage is considerably below the 80% alert threshold

Suggestions

  • Use a bigger disk which is possible because we have a virtual device but 40GB should actually be enough for a special purpose OS instance
  • Limit space used by snapshots … if snapshots actually are the culprit
  • As this is about the root filesystem and we have a separate one for /var/lib/libvirt/images it should be certainly feasible to reach well below 40GB with the root f/s. Just use btrfs fi du / $something or variants to find out where we loose the space and cleanup
  • Re-run commands from #173947#note-8 again and try to make sense of the output.

Files

Actions #1

Updated by mkittler 4 months ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by mkittler 4 months ago

  • Description updated (diff)
Actions #3

Updated by mkittler 4 months ago

  • Priority changed from Urgent to High
Actions #4

Updated by okurz 4 months ago

  • Tags changed from alert, reactive work to alert, reactive work, infra, s390x
  • Category set to Regressions/Crashes
Actions #5

Updated by robert.richardson 3 months ago

  • Subject changed from [alert] Disk `/dev/dasda2` (the btrfs root filesystem) is quite full (over 80 %) on `s390zl12.oqa.prg2.suse.org` to [alert] Disk `/dev/dasda2` (the btrfs root filesystem) is quite full (over 80 %) on `s390zl12.oqa.prg2.suse.org` size: S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz 3 months ago

  • Priority changed from High to Normal
Actions #7

Updated by okurz 3 months ago

  • Priority changed from Normal to High
Actions #8

Updated by nicksinger 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #9

Updated by nicksinger 3 months ago

So initial screening of the machine shows:

s390zl12:~ # mount -o subvolid=5 /dev/dasda2 /mnt/btrfs/
s390zl12:~ # du -sh /mnt/btrfs/*
89G	/mnt/btrfs/@
s390zl12:~ # du -sh /mnt/btrfs/@/*
15M	/mnt/btrfs/@/boot
4.0K	/mnt/btrfs/@/etc
2.4M	/mnt/btrfs/@/home
0	/mnt/btrfs/@/opt
184K	/mnt/btrfs/@/root
0	/mnt/btrfs/@/srv
4.1M	/mnt/btrfs/@/tmp
4.0K	/mnt/btrfs/@/usr
21G	/mnt/btrfs/@/var

s390zl12:/mnt/btrfs/@/.snapshots # btrfs filesystem du -s *
     Total   Exclusive  Set shared  Filename
   5.44GiB    32.00KiB     5.44GiB  400
   5.54GiB   229.93MiB     5.32GiB  646
   5.45GiB   132.00KiB     5.45GiB  647
   5.45GiB     4.00KiB     5.45GiB  648
   5.53GiB   144.00KiB     5.53GiB  649
   5.53GiB    92.00KiB     5.53GiB  650
   5.53GiB     8.16MiB     5.52GiB  651
   5.53GiB   152.00KiB     5.53GiB  652
   5.53GiB       0.00B     5.53GiB  653
   5.44GiB   132.00KiB     5.44GiB  654
   5.44GiB    28.00KiB     5.44GiB  655
   5.44GiB    44.00KiB     5.44GiB  656
     0.00B       0.00B       0.00B  grub-snapshot.cfg

so the biggest snapshot only needs 229.93MiB. However, the var-subvolume looks rather big. Checking on the live-system I can see:

s390zl12:/mnt/btrfs/@ # du -shx /var
4.9G	/var

and following these crumbs I find that /mnt/btrfs/@/var/lib/libvirt/images/ uses 17G. So we have libvirt images on the root disk which are supposed to reside on a separate disk/partition. I will clean them up and check if I can improve boot-dependencies.

Actions #10

Updated by nicksinger 3 months ago · Edited

Just for completeness, a list of these old files:

s390zl12:/mnt/btrfs/@/var/lib/libvirt/images # ls -lah
total 17G
drwx--x--x 1 root root  752 Sep  3 12:20 .
drwxr-xr-x 1 root root   88 Jun 27  2024 ..
-rw-r--r-- 1 qemu qemu 5.1G Sep  3 12:24 openQA-SUT-12a.img
-rw-r--r-- 1 qemu qemu  56M Sep  3 12:03 openQA-SUT-12.initrd
-rw-r--r-- 1 qemu qemu 8.0M Sep  3 12:03 openQA-SUT-12.kernel
-rw-r--r-- 1 root root 1.5K Sep  3 12:03 openQA-SUT-12.xml
-rw-r--r-- 1 qemu qemu 2.7G Sep  3 12:11 openQA-SUT-14a.img
-rw-r--r-- 1 qemu qemu  48M Sep  3 12:03 openQA-SUT-14.initrd
-rw-r--r-- 1 qemu qemu 7.9M Sep  3 12:03 openQA-SUT-14.kernel
-rw-r--r-- 1 root root 1.6K Sep  3 12:03 openQA-SUT-14.xml
-rw-r--r-- 1 root root 2.8G Sep  3 12:19 openQA-SUT-17a.img
-rw-r--r-- 1 root root  56M Sep  3 12:03 openQA-SUT-17.initrd
-rw-r--r-- 1 root root 8.0M Sep  3 12:03 openQA-SUT-17.kernel
-rw-r--r-- 1 root root 1.6K Sep  3 12:03 openQA-SUT-17.xml
-rw-r--r-- 1 root root 2.7G Sep  3 12:19 openQA-SUT-18a.img
-rw-r--r-- 1 root root  48M Sep  3 12:03 openQA-SUT-18.initrd
-rw-r--r-- 1 root root 7.9M Sep  3 12:03 openQA-SUT-18.kernel
-rw-r--r-- 1 root root 1.6K Sep  3 12:03 openQA-SUT-18.xml
-rw-r--r-- 1 root root 264M Sep  3 12:20 supp_sles15sp4_updatestack-s390x.qcow2
-rw-r--r-- 1 root root 2.6G Sep  3 12:20 supp_sles15sp5_updatestack-s390x.qcow2

All of this is auto-generated and rather old -> trash

Actions #12

Updated by nicksinger 3 months ago

  • Status changed from Feedback to In Progress
  • Priority changed from High to Normal

My changes only order the unit and do not require the mount point to be present. A related discussion can be found in Slack. I'm looking into possible solutions

Actions #13

Updated by openqa_review 3 months ago

  • Due date set to 2025-03-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by nicksinger 3 months ago

  • Status changed from In Progress to Feedback

My MR now includes management of these storage partitions in /etc/fstab and more complex interaction between the mount-unit and the libvirtd.service. To not break both workers at the same time automatically I removed the entry from top.sls in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1380 and just introduce the state first. After this is merged, I can test with state.apply libvirt.storage on a single host and only introduce it once everything is working as expected.

Actions #15

Updated by nicksinger 3 months ago

initial MR tested and deployed on zl12+13, fixups and to enable it permanently: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1387
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1386 is a side-product of this.

Actions #16

Updated by livdywan 3 months ago

How are we doing here at this point? I guess we still need to cover more machines? Asking since this is due this week.

Actions #17

Updated by nicksinger 3 months ago

  • Status changed from Feedback to Resolved

livdywan wrote in #note-16:

How are we doing here at this point? I guess we still need to cover more machines? Asking since this is due this week.

We're good. I wanted to wait for eventual follow-ups (and unfortunately got them). Further work on improving this will happen in #178015

Actions #18

Updated by okurz 3 months ago

  • Subject changed from [alert] Disk `/dev/dasda2` (the btrfs root filesystem) is quite full (over 80 %) on `s390zl12.oqa.prg2.suse.org` size: S to [alert] Disk `/dev/dasda2` (the btrfs root filesystem) is quite full (over 80 %) on `s390zl12.oqa.prg2.suse.org` size:S
  • Due date deleted (2025-03-06)
  • Status changed from Resolved to Workable

We found that s390zl12+13 have unaccepted salt keys, potentially related to this ticket although taking those machines out of production was never mentioned anywhere. Maybe somebody else on alert duty did.
@nicksinger please accept salt keys, apply a high state, monitor and resolve at your convenience.

Actions #19

Updated by nicksinger 3 months ago

  • Status changed from Workable to In Progress

Machines added again and a highstate applied cleanly. Checking some instances and jobs on OSD now

Actions

Also available in: Atom PDF