Project

General

Profile

Actions

action #160481

closed

backup-vm: partitions usage (%) alert & systemd services alert size:S

Added by tinita 12 months ago. Updated 12 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-05-17
Due date:
2024-06-06
% Done:

0%

Estimated time:

Description

Observation

Fri, 17 May 2024 04:01:33 +0200

1 firing alert instance
[IMAGE]

📁 GROUPED BY 

hostname=backup-vm

  🔥 1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
backup-vm: partitions usage (%) alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=86.0003690373683 
Labels
alertname
backup-vm: partitions usage (%) alert
grafana_folder

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_backup-vm/view?orgId=1

Also, possibly related:
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

Not related, this was about backup-qam (and not backup-vm which this ticket is about):

Failed systemd services
2024-05-16 15:27:50    backup-qam    check-for-kernel-crash, kdump-notify

Suggestions

  • Check partition usage and which component contributes the most space usage
  • Check what happened that we had this short high usage surge
  • Consider increasing the size of the virtually attached storage
  • Consider tweaking our backup rules to either include less or less retention Not useful, it was the root partition (but backups are on the separate partition /dev/vdb1).
  • Or maybe don't do anything if this only happened once and is not likely to happen again based on monitoring data investigation
Actions #1

Updated by tinita 12 months ago

  • Description updated (diff)
Actions #2

Updated by tinita 12 months ago

Actions #3

Updated by tinita 12 months ago

  • Subject changed from backup-vm: partitions usage (%) alert to backup-vm: partitions usage (%) alert & systemd services alert
  • Description updated (diff)
Actions #4

Updated by okurz 12 months ago

  • Tags set to infra, alert, reactive work
Actions #5

Updated by livdywan 12 months ago

  • Subject changed from backup-vm: partitions usage (%) alert & systemd services alert to backup-vm: partitions usage (%) alert & systemd services alert size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by livdywan 12 months ago

  • Description updated (diff)
Actions #7

Updated by livdywan 12 months ago

  • Description updated (diff)
Actions #8

Updated by mkittler 12 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #10

Updated by mkittler 12 months ago

  • Description updated (diff)
Actions #11

Updated by mkittler 12 months ago

Looks like an update was going on at the time. Probably some snapper cleanup "fixed" the problem later. Considering the file system reached only 86.2 % there was no real problem. Maybe we should just bump the threshold to 90 % because always expecting so much headroom seems a bit wasteful.

I freed up almost 500 MiB by uninstalling libLLVM7 libLLVM9 libLLVM11 libLLVM15 webkit2gtk-4_0-injected-bundles webkit2gtk-4_1-injected-bundles WebKitGTK-4.0-lang gnome-online-accounts gnome-online-accounts-lang. 500 MiB is not that much of course. According to ncdu the root filesystem isn't really bloated so I don't think there's much more to gain here, though. According to snapper list there's also not one big snapshot.

Actions #12

Updated by openqa_review 12 months ago

  • Due date set to 2024-06-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by mkittler 12 months ago

  • Status changed from In Progress to Resolved

I extended the root filesystem (and all other required underlying layers) by 5 GiB. Now we're at 58.7 % utilization which is good enough.

Actions

Also available in: Atom PDF