action #160481
closed
backup-vm: partitions usage (%) alert & systemd services alert size:S
Added by tinita 7 months ago.
Updated 7 months ago.
Category:
Regressions/Crashes
Description
Observation¶
Fri, 17 May 2024 04:01:33 +0200
1 firing alert instance
[IMAGE]
📁 GROUPED BY
hostname=backup-vm
🔥 1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
backup-vm: partitions usage (%) alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=86.0003690373683
Labels
alertname
backup-vm: partitions usage (%) alert
grafana_folder
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_backup-vm/view?orgId=1
Also, possibly related:
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1
Not related, this was about backup-qam
(and not backup-vm
which this ticket is about):
Failed systemd services
2024-05-16 15:27:50 backup-qam check-for-kernel-crash, kdump-notify
Suggestions¶
- Check partition usage and which component contributes the most space usage
- Check what happened that we had this short high usage surge
- Consider increasing the size of the virtually attached storage
Consider tweaking our backup rules to either include less or less retention Not useful, it was the root partition (but backups are on the separate partition /dev/vdb1
).
- Or maybe don't do anything if this only happened once and is not likely to happen again based on monitoring data investigation
- Description updated (diff)
https://stats.openqa-monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%22Y7B%22:%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22alias%22:%22$tag_device%20%28$tag_fstype%29%22,%22dsType%22:%22influxdb%22,%22function%22:%22mean%22,%22groupBy%22:%5B%7B%22interval%22:%22auto%22,%22params%22:%5B%22auto%22%5D,%22type%22:%22time%22%7D,%7B%22key%22:%22host%22,%22params%22:%5B%22tag%22%5D,%22type%22:%22tag%22%7D,%7B%22key%22:%22path%22,%22params%22:%5B%22tag%22%5D,%22type%22:%22tag%22%7D%5D,%22interval%22:%221m%22,%22intervalMs%22:1000,%22maxDataPoints%22:43200,%22measurement%22:%22disk_total%22,%22orderByTime%22:%22ASC%22,%22policy%22:%22default%22,%22query%22:%22SELECT%20mean%28%5C%22used_percent%5C%22%29%20AS%20%5C%22used_percent%5C%22%20FROM%20%5C%22disk%5C%22%20WHERE%20%28%5C%22host%5C%22%20%3D%20%27backup-vm%27%20AND%20fstype%20%21~%20%2F%5Enfs%2F%20AND%20fstype%20%21%3D%20%27udf%27%29%20AND%20$timeFilter%20GROUP%20BY%20time%28$interval%29,%20%5C%22device%5C%22,%20%5C%22fstype%5C%22%20fill%28null%29%22,%22rawQuery%22:true,%22resultFormat%22:%22time_series%22,%22select%22:%5B%5B%7B%22params%22:%5B%22value%22%5D,%22type%22:%22field%22%7D,%7B%22params%22:%5B%5D,%22type%22:%22mean%22%7D%5D%5D,%22tags%22:%5B%5D,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-2d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1 shows that
the used space went from 74 to 86 this morning
- Subject changed from backup-vm: partitions usage (%) alert to backup-vm: partitions usage (%) alert & systemd services alert
- Description updated (diff)
- Tags set to infra, alert, reactive work
- Subject changed from backup-vm: partitions usage (%) alert & systemd services alert to backup-vm: partitions usage (%) alert & systemd services alert size:S
- Description updated (diff)
- Status changed from New to Workable
- Description updated (diff)
- Description updated (diff)
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Description updated (diff)
Looks like an update was going on at the time. Probably some snapper cleanup "fixed" the problem later. Considering the file system reached only 86.2 % there was no real problem. Maybe we should just bump the threshold to 90 % because always expecting so much headroom seems a bit wasteful.
I freed up almost 500 MiB by uninstalling libLLVM7 libLLVM9 libLLVM11 libLLVM15 webkit2gtk-4_0-injected-bundles webkit2gtk-4_1-injected-bundles WebKitGTK-4.0-lang gnome-online-accounts gnome-online-accounts-lang
. 500 MiB is not that much of course. According to ncdu
the root filesystem isn't really bloated so I don't think there's much more to gain here, though. According to snapper list
there's also not one big snapshot.
- Due date set to 2024-06-06
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Resolved
I extended the root filesystem (and all other required underlying layers) by 5 GiB. Now we're at 58.7 % utilization which is good enough.
Also available in: Atom
PDF