Project

General

Profile

Actions

action #160481

closed

backup-vm: partitions usage (%) alert & systemd services alert size:S

Added by tinita 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-17
Due date:
2024-06-06
% Done:

0%

Estimated time:

Description

Observation

Fri, 17 May 2024 04:01:33 +0200

1 firing alert instance
[IMAGE]

📁 GROUPED BY 

hostname=backup-vm

  🔥 1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
backup-vm: partitions usage (%) alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=86.0003690373683 
Labels
alertname
backup-vm: partitions usage (%) alert
grafana_folder

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_backup-vm/view?orgId=1

Also, possibly related:
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

Not related, this was about backup-qam (and not backup-vm which this ticket is about):

Failed systemd services
2024-05-16 15:27:50    backup-qam    check-for-kernel-crash, kdump-notify

Suggestions

  • Check partition usage and which component contributes the most space usage
  • Check what happened that we had this short high usage surge
  • Consider increasing the size of the virtually attached storage
  • Consider tweaking our backup rules to either include less or less retention Not useful, it was the root partition (but backups are on the separate partition /dev/vdb1).
  • Or maybe don't do anything if this only happened once and is not likely to happen again based on monitoring data investigation
Actions #1

Updated by tinita 2 months ago

  • Description updated (diff)
Actions #2

Updated by tinita 2 months ago

Actions #3

Updated by tinita 2 months ago

  • Subject changed from backup-vm: partitions usage (%) alert to backup-vm: partitions usage (%) alert & systemd services alert
  • Description updated (diff)
Actions #4

Updated by okurz about 2 months ago

  • Tags set to infra, alert, reactive work
Actions #5

Updated by livdywan about 2 months ago

  • Subject changed from backup-vm: partitions usage (%) alert & systemd services alert to backup-vm: partitions usage (%) alert & systemd services alert size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by livdywan about 2 months ago

  • Description updated (diff)
Actions #7

Updated by livdywan about 2 months ago

  • Description updated (diff)
Actions #8

Updated by mkittler about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #10

Updated by mkittler about 2 months ago

  • Description updated (diff)
Actions #11

Updated by mkittler about 2 months ago

Looks like an update was going on at the time. Probably some snapper cleanup "fixed" the problem later. Considering the file system reached only 86.2 % there was no real problem. Maybe we should just bump the threshold to 90 % because always expecting so much headroom seems a bit wasteful.

I freed up almost 500 MiB by uninstalling libLLVM7 libLLVM9 libLLVM11 libLLVM15 webkit2gtk-4_0-injected-bundles webkit2gtk-4_1-injected-bundles WebKitGTK-4.0-lang gnome-online-accounts gnome-online-accounts-lang. 500 MiB is not that much of course. According to ncdu the root filesystem isn't really bloated so I don't think there's much more to gain here, though. According to snapper list there's also not one big snapshot.

Actions #12

Updated by openqa_review about 2 months ago

  • Due date set to 2024-06-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by mkittler about 2 months ago

  • Status changed from In Progress to Resolved

I extended the root filesystem (and all other required underlying layers) by 5 GiB. Now we're at 58.7 % utilization which is good enough.

Actions

Also available in: Atom PDF