action #175791: [alert] storage: partitions usage (%) alert size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #175791

closed

coordination #161414: [epic] Improved salt based infrastructure management

[alert] storage: partitions usage (%) alert size:S

Added by jbaier_cz 3 months ago. Updated 15 days ago.

Status:

Resolved

Priority:

Normal

Assignee:

gpathak

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

alert, storage, infra, reactive work

Description

Observation¶

Values
A0=85.08932639307272 
Labels
alertname     storage: partitions usage (%) alert
grafana_folder     Generic
hostname     storage
rule_uid     partitions_usage_alert_storage
type     generic

So sda on the host storage is too full (85 % full).

http://monitor.qa.suse.de/d/GDstorage?orgId=1&viewPanel=65090

Suggestions¶

Clean up storage, probably taken by backup of backup VM (see related ticket)
Do not adjust the alert itself, it is perfectly fine

Rollback actions¶

Remove the silence applied to the alert https://monitor.qa.suse.de/alerting/silences

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by jbaier_cz 3 months ago

Copied from action #150887: [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M added

Actions

Copy link

Updated by gpathak 3 months ago

Related #173347

Actions

Copy link

Updated by gpathak 3 months ago

gpathak wrote in #note-2:

Related #173347

Maybe we can get rid of /storage/backup/backup-vm/ as we have a continuous backup at /storage/rsnapshot/

Actions

Copy link

Updated by okurz 3 months ago

Category set to Regressions/Crashes

Actions

Copy link

Updated by okurz 3 months ago

Related to action #173347: Ensure we have a current backup of qamaster VMs, VM config, jenkins data, data from backup-vm itself, etc. size:S added

Actions

Copy link

Updated by okurz 3 months ago

Parent task set to #161414

Actions

Copy link

Updated by okurz 3 months ago

Priority changed from High to Urgent
Start date deleted (~~2023-11-15~~)

Repeated alert

Actions

Copy link

Updated by gpathak 3 months ago

Assignee set to gpathak

Actions

Copy link

Updated by gpathak 3 months ago

Status changed from New to In Progress

Actions

Copy link

#10

Updated by gpathak 3 months ago · Edited

@okurz
I am planning to delete /storage/backup/backup-vm/ since this is duplicate of /storage/rsnapshot/
/storage/rsnapshot/ is always a latest up to date backup, I have to update the https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md#backup-of-additional-services-running-on-qamaster accordingly if we choose to delete /storage/backup/backup-vm/

What are your thoughts? Can we move /storage/backup/backup-vm/ to some other machine?

Actions

Copy link

#11

Updated by okurz 3 months ago

Ok, go ahead

Actions

Copy link

#12

Updated by okurz 3 months ago

Subject changed from [alert] storage: partitions usage (%) alert to [alert] storage: partitions usage (%) alert size:S
Description updated (diff)

Actions

Copy link

#13

Updated by gpathak 3 months ago

Cleaned-up /storage/backup/backup-vm/ and created MR https://gitlab.suse.de/suse/wiki/-/merge_requests/8/diffs

Actions

Copy link

#14

Updated by gpathak 3 months ago

Status changed from In Progress to Feedback

Actions

Copy link

#15

Updated by livdywan 3 months ago

Status changed from Feedback to Resolved

gpathak wrote in #note-13:

Cleaned-up /storage/backup/backup-vm/ and created MR https://gitlab.suse.de/suse/wiki/-/merge_requests/8/diffs

Please remember an Urgent ticket should not remain in Feedback. If I see this correct it should be fixed, so let's resolve and re-open if there is any issues.

Actions

Copy link

#16

Updated by gpuliti 3 months ago

I've approved the mr

Actions

Copy link

#17

Updated by okurz 3 months ago

Status changed from Resolved to Workable

@livdywan as I told you our monitoring data tells if we are done. Please check again.

Actions

Copy link

#18

Updated by gpathak 3 months ago · Edited

@okurz @livdywan
Deleting /storage/backup/backup-vm/ backup of backup-vm freed-up around 222GiB of data.
We need to check old data to delete from storage if more disk space is needed. I will look into it later.
Since, the alert from Grafana is resolved, maybe we can lower the priority.

Actions

Copy link

#19

Updated by okurz 3 months ago

I think I misunderstood your proposal to delete backup-vm/ . I assumed you had an additional copy of backup-data.qcow2. Deleting backup-vm/ is in conflict with #173347. I suggest to bring back backup-vm/ and find more space elsewhere by either removing other data or ordering additional storage hardware.

Actions

Copy link

#20

Updated by livdywan 3 months ago

Status changed from Workable to In Progress

Actions

Copy link

#21

Updated by gpathak 3 months ago

okurz wrote in #note-19:

I think I misunderstood your proposal to delete backup-vm/ . I assumed you had an additional copy of backup-data.qcow2. Deleting backup-vm/ is in conflict with #173347. I suggest to bring back backup-vm/ and find more space elsewhere by either removing other data or ordering additional storage hardware.

We cannot delete anything more from storage. Bringing back backup-vm/ will cause grafana alert to trigger again, we need to silence the alert until we have additional storage.

Actions

Copy link

#22

Updated by gpathak 3 months ago

Priority changed from Urgent to High

Reducing the priority to High from Urgent , we have grafana alert resolved as of now.
We can change the priority if needed.

Actions

Copy link

#23

Updated by openqa_review 3 months ago

Due date set to 2025-02-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#24

Updated by okurz 3 months ago

Status changed from In Progress to Workable

Actions

Copy link

#25

Updated by gpathak 3 months ago

Due date deleted (~~2025-02-06~~)
Status changed from Workable to Blocked

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

Actions

Copy link

#26

Updated by livdywan 3 months ago

gpathak wrote in #note-25:

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

Still pending. And @gpathak is proactive on the ticket, just didn't mention it here which is why it hit our SLO's anyway.

Actions

Copy link

#27

Updated by livdywan 3 months ago

I escalated this in the meanwhile since we have no reliable response time with this and other similar requests.

Actions

Copy link

#28

Updated by livdywan 3 months ago

Priority changed from High to Normal

gpathak wrote in #note-25:

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

No response. Lowering prio to reflect reality.

Actions

Copy link

#30

Updated by gpathak 2 months ago

Copied to action #177766: Consider storage policy for storage.qe.prg2.suse.org size:S added

Actions

Copy link

#31

Updated by gpuliti about 2 months ago

I've created a silence for the alert starting today that lasts for 2 weeks.

2025-03-03 14:17 - 2025-03-16 23:59

Actions

Copy link

#32

Updated by gpuliti about 2 months ago

Description updated (diff)

Actions

Copy link

#33

Updated by okurz about 1 month ago

Status changed from Blocked to Workable
Priority changed from Normal to Urgent

Multiple alerts from yesterday and today. Please consider mitigations and silences.

Actions

Copy link

#34

Updated by gpathak about 1 month ago

okurz wrote in #note-33:

Multiple alerts from yesterday and today. Please consider mitigations and silences.

The alerts were related to backup-vm. I freed some space on backup-vm as a mitigation but we need to free up more space mainly from /home directory

Actions

Copy link

#35

Updated by gpathak about 1 month ago

gpathak wrote in #note-34:

okurz wrote in #note-33:

Multiple alerts from yesterday and today. Please consider mitigations and silences.

The alerts were related to backup-vm. I freed some space on backup-vm as a mitigation but we need to free up more space mainly from /home directory

Three big directories consuming around ~1.14TiB:

Directory	Size
/home/backup	484 GiB
/home/okurz	472 GiB
/home/nsinger	191 GiB

Actions

Copy link

#36

Updated by gpathak about 1 month ago

Status changed from Workable to In Progress
Priority changed from Urgent to High

Reducing priority since alert is resolved, found some directories that can be cleaned up. Need to discuss about this in or after the daily.

Actions

Copy link

#37

Updated by gpathak about 1 month ago · Edited

Status changed from In Progress to Blocked
Priority changed from High to Normal

livdywan wrote in #note-28:

gpathak wrote in #note-25:

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

No response. Lowering prio to reflect reality.

Thanks to @nicksinger for cleaning-up space in his home directory on backup-vm , the usage is now ~74.6%
Lowering the priority and again blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

Actions

Copy link

#38

Updated by gpathak 18 days ago

livdywan wrote in #note-28:

gpathak wrote in #note-25:

Blocking on https://sd.suse.com/servicedesk/customer/portal/1/SD-178515

No response. Lowering prio to reflect reality.

Two SSDs of 7T each are installed today on storage host.
Now my question is, what is the preferred way of merging/combining multiple storage devices (SSDs) into one?

Actions

Copy link

#39

Updated by gpathak 16 days ago

We also have to change the target disk device for storage host in Grafana monitoring alert rules when we choose to combine three disks to one to create a LVM group.
The /storage is the mount point for /dev/sda, when we combine /dev/sda, /dev/sdm and /dev/sdn, we'll have to modify target device node /dev/sda to the newly created LVM group node device for:

Actions

Copy link

#40

Updated by livdywan 16 days ago · Edited

Status changed from Blocked to Feedback

Just to keep the ticket updated as we're discussing how to continue here. We have /dev/sdm and /dev/sdn/ now with 7T each and should decide how to use them. Hence not Blocked.

Actions

Copy link

#41

Updated by gpathak 15 days ago · Edited

Status changed from Feedback to Resolved

@nicksinger pointed out that we use btrfs array, and suggested to simply use btrfs device add /dev/sdm /storage and btrfs device add /dev/sdn /storage
Later I used btrfs balance start -dusage=5 /storage/ to balance the data usage across all disk devices in btrfs array.
So, nothing special was needed here.

The storage host partition grafana dashboard now shows 40.5% disk mean usage for /dev/sda https://monitor.qa.suse.de/d/GDstorage/dashboard-for-storage?viewPanel=panel-65090&orgId=1&from=2025-04-14T10:30:00.000Z&to=2025-04-14T10:48:00.000Z&timezone=browser&var-datasource=000000001&refresh=1m

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #175791

[alert] storage: partitions usage (%) alert size:S

Observation¶

Suggestions¶

Rollback actions¶

Updated by jbaier_cz 3 months ago

Updated by gpathak 3 months ago

Updated by gpathak 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by gpathak 3 months ago

Updated by gpathak 3 months ago

Updated by gpathak 3 months ago · Edited

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by gpathak 3 months ago

Updated by gpathak 3 months ago

Updated by livdywan 3 months ago

Updated by gpuliti 3 months ago

Updated by okurz 3 months ago

Updated by gpathak 3 months ago · Edited

Updated by okurz 3 months ago

Updated by livdywan 3 months ago

Updated by gpathak 3 months ago

Updated by gpathak 3 months ago

Updated by openqa_review 3 months ago

Updated by okurz 3 months ago

Updated by gpathak 3 months ago

Updated by livdywan 3 months ago

Updated by livdywan 3 months ago

Updated by livdywan 3 months ago

Updated by gpathak 2 months ago

Updated by gpuliti about 2 months ago

Updated by gpuliti about 2 months ago

Updated by okurz about 1 month ago

Updated by gpathak about 1 month ago

Updated by gpathak about 1 month ago

Updated by gpathak about 1 month ago

Updated by gpathak about 1 month ago · Edited

Updated by gpathak 18 days ago

Updated by gpathak 16 days ago

Updated by livdywan 16 days ago · Edited

Updated by gpathak 15 days ago · Edited