action #175791
closedcoordination #161414: [epic] Improved salt based infrastructure management
[alert] storage: partitions usage (%) alert size:S
0%
Description
Observation¶
Values
A0=85.08932639307272
Labels
alertname storage: partitions usage (%) alert
grafana_folder Generic
hostname storage
rule_uid partitions_usage_alert_storage
type generic
So sda on the host storage is too full (85 % full).
http://monitor.qa.suse.de/d/GDstorage?orgId=1&viewPanel=65090
Suggestions¶
- Clean up storage, probably taken by backup of backup VM (see related ticket)
- Do not adjust the alert itself, it is perfectly fine
Rollback actions¶
- Remove the silence applied to the alert https://monitor.qa.suse.de/alerting/silences
Updated by jbaier_cz 3 months ago
- Copied from action #150887: [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M added
Updated by okurz 3 months ago
- Related to action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added
Updated by gpathak 3 months ago · Edited
@okurz
I am planning to delete /storage/backup/backup-vm/
since this is duplicate of /storage/rsnapshot/
/storage/rsnapshot/
is always a latest up to date backup, I have to update the https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md#backup-of-additional-services-running-on-qamaster accordingly if we choose to delete /storage/backup/backup-vm/
What are your thoughts? Can we move /storage/backup/backup-vm/
to some other machine?
Updated by gpathak 3 months ago
Cleaned-up /storage/backup/backup-vm/
and created MR https://gitlab.suse.de/suse/wiki/-/merge_requests/8/diffs
Updated by livdywan 3 months ago
- Status changed from Feedback to Resolved
gpathak wrote in #note-13:
Cleaned-up
/storage/backup/backup-vm/
and created MR https://gitlab.suse.de/suse/wiki/-/merge_requests/8/diffs
Please remember an Urgent ticket should not remain in Feedback. If I see this correct it should be fixed, so let's resolve and re-open if there is any issues.
Updated by okurz 3 months ago
I think I misunderstood your proposal to delete backup-vm/ . I assumed you had an additional copy of backup-data.qcow2. Deleting backup-vm/ is in conflict with #173347. I suggest to bring back backup-vm/ and find more space elsewhere by either removing other data or ordering additional storage hardware.
Updated by gpathak 3 months ago
okurz wrote in #note-19:
I think I misunderstood your proposal to delete backup-vm/ . I assumed you had an additional copy of backup-data.qcow2. Deleting backup-vm/ is in conflict with #173347. I suggest to bring back backup-vm/ and find more space elsewhere by either removing other data or ordering additional storage hardware.
We cannot delete anything more from storage. Bringing back backup-vm/
will cause grafana alert to trigger again, we need to silence the alert until we have additional storage.
Updated by openqa_review 3 months ago
- Due date set to 2025-02-06
Setting due date based on mean cycle time of SUSE QE Tools
Updated by gpathak 2 months ago
- Copied to action #177766: Consider storage policy for storage.qe.prg2.suse.org size:S added
Updated by gpuliti about 2 months ago
I've created a silence for the alert starting today that lasts for 2 weeks.
2025-03-03 14:17 - 2025-03-16 23:59
Updated by okurz about 1 month ago
- Status changed from Blocked to Workable
- Priority changed from Normal to Urgent
Multiple alerts from yesterday and today. Please consider mitigations and silences.
Updated by gpathak about 1 month ago
okurz wrote in #note-33:
Multiple alerts from yesterday and today. Please consider mitigations and silences.
The alerts were related to backup-vm. I freed some space on backup-vm as a mitigation but we need to free up more space mainly from /home directory
Updated by gpathak about 1 month ago
gpathak wrote in #note-34:
okurz wrote in #note-33:
Multiple alerts from yesterday and today. Please consider mitigations and silences.
The alerts were related to backup-vm. I freed some space on backup-vm as a mitigation but we need to free up more space mainly from /home directory
Three big directories consuming around ~1.14TiB:
Directory | Size |
---|---|
/home/backup | 484 GiB |
/home/okurz | 472 GiB |
/home/nsinger | 191 GiB |
Updated by gpathak about 1 month ago
- Status changed from Workable to In Progress
- Priority changed from Urgent to High
Reducing priority since alert is resolved, found some directories that can be cleaned up. Need to discuss about this in or after the daily.
Updated by gpathak about 1 month ago · Edited
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
livdywan wrote in #note-28:
gpathak wrote in #note-25:
Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515
No response. Lowering prio to reflect reality.
Thanks to @nicksinger for cleaning-up space in his home directory on backup-vm , the usage is now ~74.6%
Lowering the priority and again blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515
Updated by gpathak 18 days ago
livdywan wrote in #note-28:
gpathak wrote in #note-25:
Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515
No response. Lowering prio to reflect reality.
Two SSDs of 7T each are installed today on storage host.
Now my question is, what is the preferred way of merging/combining multiple storage devices (SSDs) into one?
Updated by gpathak 16 days ago
We also have to change the target disk device for storage host in Grafana monitoring alert rules when we choose to combine three disks to one to create a LVM group.
The /storage
is the mount point for /dev/sda
, when we combine /dev/sda
, /dev/sdm
and /dev/sdn
, we'll have to modify target device node /dev/sda
to the newly created LVM group node device for:
Updated by livdywan 16 days ago · Edited
- Status changed from Blocked to Feedback
Just to keep the ticket updated as we're discussing how to continue here. We have /dev/sdm and /dev/sdn/ now with 7T each and should decide how to use them. Hence not Blocked.
Updated by gpathak 15 days ago · Edited
- Status changed from Feedback to Resolved
@nicksinger pointed out that we use btrfs array, and suggested to simply use btrfs device add /dev/sdm /storage
and btrfs device add /dev/sdn /storage
Later I used btrfs balance start -dusage=5 /storage/
to balance the data usage across all disk devices in btrfs array.
So, nothing special was needed here.
The storage host partition grafana dashboard now shows 40.5% disk mean usage for /dev/sda
https://monitor.qa.suse.de/d/GDstorage/dashboard-for-storage?viewPanel=panel-65090&orgId=1&from=2025-04-14T10:30:00.000Z&to=2025-04-14T10:48:00.000Z&timezone=browser&var-datasource=000000001&refresh=1m