Project

General

Profile

Actions

action #177766

closed

coordination #161414: [epic] Improved salt based infrastructure management

Consider storage policy for storage.qe.prg2.suse.org size:S

Added by gpathak 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Start date:
2025-02-24
Due date:
% Done:

0%

Estimated time:

Description

Motivation

We always keep resolving storage host alert getting over more than 85% and while doing so we always scratch our head what data to delete.
Instead we should come-up with some data backup and retention policy for OSD and if possible for O3 as well, such that we should never have to be worried about low storage space for automatic data backup, unless there are some unavoidable circumstances.

Acceptance Criteria

Suggestions

  • ask on slack in #eng-testing and if people don't speak up it's their fault
  • Save less snapshots
  • exclude certain data
  • enter filenames of old assets at the search at https://openqa.suse.de/admin/assets and remove them if they're not used anymore
  • Discuss within tools team about backup and retention policy and come-up with an optimal backup proposal (keeping the motivation in mind)
  • Discuss and present the proposal to other teams to bring everyone on the same page, if required re-iterate the proposal from AC1
  • Cleanup old assets/data/logs from OSD and if required from O3 as well, implement the proposal (approved from AC2)

Further details

storage.qe.prg2.suse.org via rsnapshot in /home/rsnapshot
 - backup of openqa data (test result files without assets - "test result archive" - e.g. screenshots, video, serial log)
   - archive
   - fixed isos
   - fixed hdd images
 
 backup-vm via rsnapshot /home/rsnapshot
- osd database + /etc

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #175791: [alert] storage: partitions usage (%) alert size:SResolvedgpathak

Actions
Actions #1

Updated by gpathak 2 months ago

  • Copied from action #175791: [alert] storage: partitions usage (%) alert size:S added
Actions #2

Updated by dheidler 2 months ago

  • Subject changed from Consider storage policy for storage.qe.prg2.suse.org to Consider storage policy for storage.qe.prg2.suse.org [size:S]
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz 2 months ago

  • Subject changed from Consider storage policy for storage.qe.prg2.suse.org [size:S] to Consider storage policy for storage.qe.prg2.suse.org size:S
Actions #4

Updated by gpathak about 1 month ago

  • Assignee set to gpathak
Actions #5

Updated by gpathak about 1 month ago

  • Status changed from Workable to In Progress
Actions #7

Updated by openqa_review about 1 month ago

  • Due date set to 2025-04-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by gpathak about 1 month ago

I will look into O3 assets as well.

Actions #10

Updated by gpathak about 1 month ago

I have deleted two hdd files from O3:

We still have backup of these files on our storage host, but these will be removed from storage backup after approximately 4 months.

@dheidler Can we reduce the number of rsync backup?
Right now we have 3 backups of alpha and beta and 2 for gamma.
How about reducing each level by 1 to have 2 backups of alpha and beta and 1 for gamma? This way less storage can be used.
Maybe we can revert this once #175791 is resolved or continue with the above proposal if we have enough backup even after reduced number of rsnapshot levels.

@okurz @livdywan Any thoughts?

Actions #11

Updated by okurz about 1 month ago

gpathak wrote in #note-10:

I have deleted two hdd files from O3:

We still have backup of these files on our storage host, but these will be removed from storage backup after approximately 4 months.

@dheidler Can we reduce the number of rsync backup?
Right now we have 3 backups of alpha and beta and 2 for gamma.
How about reducing each level by 1 to have 2 backups of alpha and beta and 1 for gamma? This way less storage can be used.

How much less storage would that use?

Maybe we can revert this once #175791 is resolved or continue with the above proposal if we have enough backup even after reduced number of rsnapshot levels.

Agreed. This can be a temporary mitigation to ensure we don't run out of storage space and should be reverted once more storage is fitted into the systems.

Actions #12

Updated by gpathak about 1 month ago

okurz wrote in #note-11:

gpathak wrote in #note-10:

I have deleted two hdd files from O3:

We still have backup of these files on our storage host, but these will be removed from storage backup after approximately 4 months.

@dheidler Can we reduce the number of rsync backup?
Right now we have 3 backups of alpha and beta and 2 for gamma.
How about reducing each level by 1 to have 2 backups of alpha and beta and 1 for gamma? This way less storage can be used.
How much less storage would that use?

It would use ~460GiB less if we use 2 alpha snapshot instead of 3.

Actions #13

Updated by gpathak about 1 month ago

gpathak wrote in #note-12:

okurz wrote in #note-11:

gpathak wrote in #note-10:

I have deleted two hdd files from O3:

We still have backup of these files on our storage host, but these will be removed from storage backup after approximately 4 months.

@dheidler Can we reduce the number of rsync backup?
Right now we have 3 backups of alpha and beta and 2 for gamma.
How about reducing each level by 1 to have 2 backups of alpha and beta and 1 for gamma? This way less storage can be used.
How much less storage would that use?

It would use ~460GiB less if we use 2 alpha snapshot instead of 3.

Created a MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1411 reduced each level snapshot by 1, so it should use ~1.3TB (460GiB x 3) less space.

Actions #14

Updated by gpathak about 1 month ago ยท Edited

gpathak wrote in #note-13:

Created a MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1411 reduced each level snapshot by 1, so it should use ~1.3TB (460GiB x 3) less space.

We changed the approach and decided to disable use_lazy_delete, the MR is merged and I have also removed existing single _delete* directory, the usage is now at 71% storage host disk usage Grafana Dashboard

Actions #15

Updated by gpathak about 1 month ago

  • Status changed from In Progress to Resolved
Actions #16

Updated by okurz about 1 month ago

  • Due date deleted (2025-04-01)
Actions

Also available in: Atom PDF