action #160481
closed
backup-vm: partitions usage (%) alert & systemd services alert size:S
Added by tinita 12 months ago.
Updated 11 months ago.
Category:
Regressions/Crashes
Description
Observation¶
Fri, 17 May 2024 04:01:33 +0200
1 firing alert instance
[IMAGE]
📁 GROUPED BY
hostname=backup-vm
🔥 1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
backup-vm: partitions usage (%) alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=86.0003690373683
Labels
alertname
backup-vm: partitions usage (%) alert
grafana_folder
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_backup-vm/view?orgId=1
Also, possibly related:
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1
Not related, this was about backup-qam
(and not backup-vm
which this ticket is about):
Failed systemd services
2024-05-16 15:27:50 backup-qam check-for-kernel-crash, kdump-notify
Suggestions¶
- Check partition usage and which component contributes the most space usage
- Check what happened that we had this short high usage surge
- Consider increasing the size of the virtually attached storage
-
Consider tweaking our backup rules to either include less or less retention Not useful, it was the root partition (but backups are on the separate partition /dev/vdb1
).
- Or maybe don't do anything if this only happened once and is not likely to happen again based on monitoring data investigation
- Description updated (diff)
https://stats.openqa-monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%22Y7B%22:%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22alias%22:%22$tag_device%20%28$tag_fstype%29%22,%22dsType%22:%22influxdb%22,%22function%22:%22mean%22,%22groupBy%22:%5B%7B%22interval%22:%22auto%22,%22params%22:%5B%22auto%22%5D,%22type%22:%22time%22%7D,%7B%22key%22:%22host%22,%22params%22:%5B%22tag%22%5D,%22type%22:%22tag%22%7D,%7B%22key%22:%22path%22,%22params%22:%5B%22tag%22%5D,%22type%22:%22tag%22%7D%5D,%22interval%22:%221m%22,%22intervalMs%22:1000,%22maxDataPoints%22:43200,%22measurement%22:%22disk_total%22,%22orderByTime%22:%22ASC%22,%22policy%22:%22default%22,%22query%22:%22SELECT%20mean%28%5C%22used_percent%5C%22%29%20AS%20%5C%22used_percent%5C%22%20FROM%20%5C%22disk%5C%22%20WHERE%20%28%5C%22host%5C%22%20%3D%20%27backup-vm%27%20AND%20fstype%20%21~%20%2F%5Enfs%2F%20AND%20fstype%20%21%3D%20%27udf%27%29%20AND%20$timeFilter%20GROUP%20BY%20time%28$interval%29,%20%5C%22device%5C%22,%20%5C%22fstype%5C%22%20fill%28null%29%22,%22rawQuery%22:true,%22resultFormat%22:%22time_series%22,%22select%22:%5B%5B%7B%22params%22:%5B%22value%22%5D,%22type%22:%22field%22%7D,%7B%22params%22:%5B%5D,%22type%22:%22mean%22%7D%5D%5D,%22tags%22:%5B%5D,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-2d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1 shows that
the used space went from 74 to 86 this morning
- Subject changed from backup-vm: partitions usage (%) alert to backup-vm: partitions usage (%) alert & systemd services alert
- Description updated (diff)
- Tags set to infra, alert, reactive work
- Subject changed from backup-vm: partitions usage (%) alert & systemd services alert to backup-vm: partitions usage (%) alert & systemd services alert size:S
- Description updated (diff)
- Status changed from New to Workable
- Description updated (diff)
- Description updated (diff)
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Description updated (diff)
Looks like an update was going on at the time. Probably some snapper cleanup "fixed" the problem later. Considering the file system reached only 86.2 % there was no real problem. Maybe we should just bump the threshold to 90 % because always expecting so much headroom seems a bit wasteful.
I freed up almost 500 MiB by uninstalling libLLVM7 libLLVM9 libLLVM11 libLLVM15 webkit2gtk-4_0-injected-bundles webkit2gtk-4_1-injected-bundles WebKitGTK-4.0-lang gnome-online-accounts gnome-online-accounts-lang
. 500 MiB is not that much of course. According to ncdu
the root filesystem isn't really bloated so I don't think there's much more to gain here, though. According to snapper list
there's also not one big snapshot.
- Due date set to 2024-06-06
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Resolved
I extended the root filesystem (and all other required underlying layers) by 5 GiB. Now we're at 58.7 % utilization which is good enough.
Also available in: Atom
PDF