Project

General

Profile

Actions

action #175791

closed

coordination #161414: [epic] Improved salt based infrastructure management

[alert] storage: partitions usage (%) alert size:S

Added by jbaier_cz 3 months ago. Updated 15 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Values
A0=85.08932639307272 
Labels
alertname     storage: partitions usage (%) alert
grafana_folder     Generic
hostname     storage
rule_uid     partitions_usage_alert_storage
type     generic

So sda on the host storage is too full (85 % full).

http://monitor.qa.suse.de/d/GDstorage?orgId=1&viewPanel=65090

Suggestions

  • Clean up storage, probably taken by backup of backup VM (see related ticket)
  • Do not adjust the alert itself, it is perfectly fine

Rollback actions


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:SResolvedgpathak

Actions
Copied from openQA Infrastructure (public) - action #150887: [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:MResolvedokurz2023-11-15

Actions
Copied to openQA Infrastructure (public) - action #177766: Consider storage policy for storage.qe.prg2.suse.org size:SResolvedgpathak2025-02-24

Actions
Actions #1

Updated by jbaier_cz 3 months ago

  • Copied from action #150887: [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M added
Actions #2

Updated by gpathak 3 months ago

Related #173347

Actions #3

Updated by gpathak 3 months ago

gpathak wrote in #note-2:

Related #173347

Maybe we can get rid of /storage/backup/backup-vm/ as we have a continuous backup at /storage/rsnapshot/

Actions #4

Updated by okurz 3 months ago

  • Category set to Regressions/Crashes
Actions #5

Updated by okurz 3 months ago

  • Related to action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added
Actions #6

Updated by okurz 3 months ago

  • Parent task set to #161414
Actions #7

Updated by okurz 3 months ago

  • Priority changed from High to Urgent
  • Start date deleted (2023-11-15)

Repeated alert

Actions #8

Updated by gpathak 3 months ago

  • Assignee set to gpathak
Actions #9

Updated by gpathak 3 months ago

  • Status changed from New to In Progress
Actions #10

Updated by gpathak 3 months ago · Edited

@okurz
I am planning to delete /storage/backup/backup-vm/ since this is duplicate of /storage/rsnapshot/
/storage/rsnapshot/ is always a latest up to date backup, I have to update the https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md#backup-of-additional-services-running-on-qamaster accordingly if we choose to delete /storage/backup/backup-vm/

What are your thoughts? Can we move /storage/backup/backup-vm/ to some other machine?

Actions #11

Updated by okurz 3 months ago

Ok, go ahead

Actions #12

Updated by okurz 3 months ago

  • Subject changed from [alert] storage: partitions usage (%) alert to [alert] storage: partitions usage (%) alert size:S
  • Description updated (diff)
Actions #13

Updated by gpathak 3 months ago

Cleaned-up /storage/backup/backup-vm/ and created MR https://gitlab.suse.de/suse/wiki/-/merge_requests/8/diffs

Actions #14

Updated by gpathak 3 months ago

  • Status changed from In Progress to Feedback
Actions #15

Updated by livdywan 3 months ago

  • Status changed from Feedback to Resolved

gpathak wrote in #note-13:

Cleaned-up /storage/backup/backup-vm/ and created MR https://gitlab.suse.de/suse/wiki/-/merge_requests/8/diffs

Please remember an Urgent ticket should not remain in Feedback. If I see this correct it should be fixed, so let's resolve and re-open if there is any issues.

Actions #16

Updated by gpuliti 3 months ago

I've approved the mr

Actions #17

Updated by okurz 3 months ago

  • Status changed from Resolved to Workable

@livdywan as I told you our monitoring data tells if we are done. Please check again.

Actions #18

Updated by gpathak 3 months ago · Edited

@okurz @livdywan
Deleting /storage/backup/backup-vm/ backup of backup-vm freed-up around 222GiB of data.
We need to check old data to delete from storage if more disk space is needed. I will look into it later.
Since, the alert from Grafana is resolved, maybe we can lower the priority.

Actions #19

Updated by okurz 3 months ago

I think I misunderstood your proposal to delete backup-vm/ . I assumed you had an additional copy of backup-data.qcow2. Deleting backup-vm/ is in conflict with #173347. I suggest to bring back backup-vm/ and find more space elsewhere by either removing other data or ordering additional storage hardware.

Actions #20

Updated by livdywan 3 months ago

  • Status changed from Workable to In Progress
Actions #21

Updated by gpathak 3 months ago

okurz wrote in #note-19:

I think I misunderstood your proposal to delete backup-vm/ . I assumed you had an additional copy of backup-data.qcow2. Deleting backup-vm/ is in conflict with #173347. I suggest to bring back backup-vm/ and find more space elsewhere by either removing other data or ordering additional storage hardware.

We cannot delete anything more from storage. Bringing back backup-vm/ will cause grafana alert to trigger again, we need to silence the alert until we have additional storage.

Actions #22

Updated by gpathak 3 months ago

  • Priority changed from Urgent to High

Reducing the priority to High from Urgent , we have grafana alert resolved as of now.
We can change the priority if needed.

Actions #23

Updated by openqa_review 3 months ago

  • Due date set to 2025-02-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #24

Updated by okurz 3 months ago

  • Status changed from In Progress to Workable
Actions #25

Updated by gpathak 3 months ago

  • Due date deleted (2025-02-06)
  • Status changed from Workable to Blocked
Actions #26

Updated by livdywan 3 months ago

gpathak wrote in #note-25:

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

Still pending. And @gpathak is proactive on the ticket, just didn't mention it here which is why it hit our SLO's anyway.

Actions #27

Updated by livdywan 3 months ago

I escalated this in the meanwhile since we have no reliable response time with this and other similar requests.

Actions #28

Updated by livdywan 3 months ago

  • Priority changed from High to Normal

gpathak wrote in #note-25:

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

No response. Lowering prio to reflect reality.

Actions #30

Updated by gpathak 2 months ago

  • Copied to action #177766: Consider storage policy for storage.qe.prg2.suse.org size:S added
Actions #31

Updated by gpuliti about 2 months ago

I've created a silence for the alert starting today that lasts for 2 weeks.

2025-03-03 14:17 - 2025-03-16 23:59

Actions #32

Updated by gpuliti about 2 months ago

  • Description updated (diff)
Actions #33

Updated by okurz about 1 month ago

  • Status changed from Blocked to Workable
  • Priority changed from Normal to Urgent

Multiple alerts from yesterday and today. Please consider mitigations and silences.

Actions #34

Updated by gpathak about 1 month ago

okurz wrote in #note-33:

Multiple alerts from yesterday and today. Please consider mitigations and silences.

The alerts were related to backup-vm. I freed some space on backup-vm as a mitigation but we need to free up more space mainly from /home directory

Actions #35

Updated by gpathak about 1 month ago

gpathak wrote in #note-34:

okurz wrote in #note-33:

Multiple alerts from yesterday and today. Please consider mitigations and silences.

The alerts were related to backup-vm. I freed some space on backup-vm as a mitigation but we need to free up more space mainly from /home directory

Three big directories consuming around ~1.14TiB:

Directory Size
/home/backup 484 GiB
/home/okurz 472 GiB
/home/nsinger 191 GiB
Actions #36

Updated by gpathak about 1 month ago

  • Status changed from Workable to In Progress
  • Priority changed from Urgent to High

Reducing priority since alert is resolved, found some directories that can be cleaned up. Need to discuss about this in or after the daily.

Actions #37

Updated by gpathak about 1 month ago · Edited

  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal

livdywan wrote in #note-28:

gpathak wrote in #note-25:

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

No response. Lowering prio to reflect reality.

Thanks to @nicksinger for cleaning-up space in his home directory on backup-vm , the usage is now ~74.6%
Lowering the priority and again blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

Actions #38

Updated by gpathak 18 days ago

livdywan wrote in #note-28:

gpathak wrote in #note-25:

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

No response. Lowering prio to reflect reality.

Two SSDs of 7T each are installed today on storage host.
Now my question is, what is the preferred way of merging/combining multiple storage devices (SSDs) into one?

Actions #39

Updated by gpathak 16 days ago

We also have to change the target disk device for storage host in Grafana monitoring alert rules when we choose to combine three disks to one to create a LVM group.
The /storage is the mount point for /dev/sda, when we combine /dev/sda, /dev/sdm and /dev/sdn, we'll have to modify target device node /dev/sda to the newly created LVM group node device for:

  1. Disk I/O Time Alert
  2. Partition Usage Alert
Actions #40

Updated by livdywan 16 days ago · Edited

  • Status changed from Blocked to Feedback

Just to keep the ticket updated as we're discussing how to continue here. We have /dev/sdm and /dev/sdn/ now with 7T each and should decide how to use them. Hence not Blocked.

Actions #41

Updated by gpathak 15 days ago · Edited

  • Status changed from Feedback to Resolved

@nicksinger pointed out that we use btrfs array, and suggested to simply use btrfs device add /dev/sdm /storage and btrfs device add /dev/sdn /storage
Later I used btrfs balance start -dusage=5 /storage/ to balance the data usage across all disk devices in btrfs array.
So, nothing special was needed here.

The storage host partition grafana dashboard now shows 40.5% disk mean usage for /dev/sda https://monitor.qa.suse.de/d/GDstorage/dashboard-for-storage?viewPanel=panel-65090&orgId=1&from=2025-04-14T10:30:00.000Z&to=2025-04-14T10:48:00.000Z&timezone=browser&var-datasource=000000001&refresh=1m

Actions

Also available in: Atom PDF