action #160481: backup-vm: partitions usage (%) alert & systemd services alert size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #160481

closed

backup-vm: partitions usage (%) alert & systemd services alert size:S

Added by tinita 12 months ago. Updated 12 months ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-05-17

Due date:

2024-06-06

% Done:

Estimated time:

Tags:

alert, infra, reactive work

Description

Observation¶

Fri, 17 May 2024 04:01:33 +0200

1 firing alert instance
[IMAGE]

📁 GROUPED BY 

hostname=backup-vm

  🔥 1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
backup-vm: partitions usage (%) alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=86.0003690373683 
Labels
alertname
backup-vm: partitions usage (%) alert
grafana_folder

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/partitions_usage_alert_backup-vm/view?orgId=1

Also, possibly related:
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1
Not related, this was about backup-qam (and not backup-vm which this ticket is about):

Failed systemd services
2024-05-16 15:27:50    backup-qam    check-for-kernel-crash, kdump-notify

Suggestions¶

Check partition usage and which component contributes the most space usage
Check what happened that we had this short high usage surge
Consider increasing the size of the virtually attached storage
~~Consider tweaking our backup rules to either include less or less retention~~ Not useful, it was the root partition (but backups are on the separate partition /dev/vdb1).
Or maybe don't do anything if this only happened once and is not likely to happen again based on monitoring data investigation

Actions

Copy link

Updated by tinita 12 months ago

Description updated (diff)

Actions

Copy link

Updated by tinita 12 months ago

https://stats.openqa-monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%22Y7B%22:%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22alias%22:%22$tag_device%20%28$tag_fstype%29%22,%22dsType%22:%22influxdb%22,%22function%22:%22mean%22,%22groupBy%22:%5B%7B%22interval%22:%22auto%22,%22params%22:%5B%22auto%22%5D,%22type%22:%22time%22%7D,%7B%22key%22:%22host%22,%22params%22:%5B%22tag%22%5D,%22type%22:%22tag%22%7D,%7B%22key%22:%22path%22,%22params%22:%5B%22tag%22%5D,%22type%22:%22tag%22%7D%5D,%22interval%22:%221m%22,%22intervalMs%22:1000,%22maxDataPoints%22:43200,%22measurement%22:%22disk_total%22,%22orderByTime%22:%22ASC%22,%22policy%22:%22default%22,%22query%22:%22SELECT%20mean%28%5C%22used_percent%5C%22%29%20AS%20%5C%22used_percent%5C%22%20FROM%20%5C%22disk%5C%22%20WHERE%20%28%5C%22host%5C%22%20%3D%20%27backup-vm%27%20AND%20fstype%20%21~%20%2F%5Enfs%2F%20AND%20fstype%20%21%3D%20%27udf%27%29%20AND%20$timeFilter%20GROUP%20BY%20time%28$interval%29,%20%5C%22device%5C%22,%20%5C%22fstype%5C%22%20fill%28null%29%22,%22rawQuery%22:true,%22resultFormat%22:%22time_series%22,%22select%22:%5B%5B%7B%22params%22:%5B%22value%22%5D,%22type%22:%22field%22%7D,%7B%22params%22:%5B%5D,%22type%22:%22mean%22%7D%5D%5D,%22tags%22:%5B%5D,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-2d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1 shows that
the used space went from 74 to 86 this morning

Actions

Copy link

Updated by tinita 12 months ago

Subject changed from backup-vm: partitions usage (%) alert to backup-vm: partitions usage (%) alert & systemd services alert
Description updated (diff)

Actions

Copy link

Updated by okurz 12 months ago

Tags set to infra, alert, reactive work

Actions

Copy link

Updated by livdywan 12 months ago

Subject changed from backup-vm: partitions usage (%) alert & systemd services alert to backup-vm: partitions usage (%) alert & systemd services alert size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by livdywan 12 months ago

Description updated (diff)

Actions

Copy link

Updated by livdywan 12 months ago

Description updated (diff)

Actions

Copy link

Updated by mkittler 12 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 12 months ago

URL to relevant panel at the relevant time: https://stats.openqa-monitor.qa.suse.de/d/GDbackup-vm/dashboard-for-backup-vm?viewPanel=65090&orgId=1&from=1715665167550&to=1716249102721

Actions

Copy link

#10

Updated by mkittler 12 months ago

Description updated (diff)

Actions

Copy link

#11

Updated by mkittler 12 months ago

Looks like an update was going on at the time. Probably some snapper cleanup "fixed" the problem later. Considering the file system reached only 86.2 % there was no real problem. Maybe we should just bump the threshold to 90 % because always expecting so much headroom seems a bit wasteful.

I freed up almost 500 MiB by uninstalling libLLVM7 libLLVM9 libLLVM11 libLLVM15 webkit2gtk-4_0-injected-bundles webkit2gtk-4_1-injected-bundles WebKitGTK-4.0-lang gnome-online-accounts gnome-online-accounts-lang. 500 MiB is not that much of course. According to ncdu the root filesystem isn't really bloated so I don't think there's much more to gain here, though. According to snapper list there's also not one big snapshot.

Actions

Copy link

#12

Updated by openqa_review 12 months ago

Due date set to 2024-06-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#13

Updated by mkittler 12 months ago

Status changed from In Progress to Resolved

I extended the root filesystem (and all other required underlying layers) by 5 GiB. Now we're at 58.7 % utilization which is good enough.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #160481

backup-vm: partitions usage (%) alert & systemd services alert size:S

Observation¶

Suggestions¶

Updated by tinita 12 months ago

Updated by tinita 12 months ago

Updated by tinita 12 months ago

Updated by okurz 12 months ago

Updated by livdywan 12 months ago

Updated by livdywan 12 months ago

Updated by livdywan 12 months ago

Updated by mkittler 12 months ago

Updated by mkittler 12 months ago

Updated by mkittler 12 months ago

Updated by mkittler 12 months ago

Updated by openqa_review 12 months ago

Updated by mkittler 12 months ago