action #177766
closedcoordination #161414: [epic] Improved salt based infrastructure management
Consider storage policy for storage.qe.prg2.suse.org size:S
0%
Description
Motivation¶
We always keep resolving storage host alert getting over more than 85% and while doing so we always scratch our head what data to delete.
Instead we should come-up with some data backup and retention policy for OSD and if possible for O3 as well, such that we should never have to be worried about low storage space for automatic data backup, unless there are some unavoidable circumstances.
Acceptance Criteria¶
- AC1: We have a properly documented backup and retention policy for storage.qe.prg2.suse.org (e.g. on qe-infra wiki page https://gitlab.suse.de/suse/wiki/-/blob/main/qe_infrastructure.md)
- AC2: We don't get the regular alert that the storage is running full
- AC3: We still keep data that is relevant for users
Suggestions¶
- ask on slack in #eng-testing and if people don't speak up it's their fault
- Save less snapshots
- exclude certain data
- enter filenames of old assets at the search at https://openqa.suse.de/admin/assets and remove them if they're not used anymore
- Discuss within tools team about backup and retention policy and come-up with an optimal backup proposal (keeping the motivation in mind)
- Discuss and present the proposal to other teams to bring everyone on the same page, if required re-iterate the proposal from AC1
- Cleanup old assets/data/logs from OSD and if required from O3 as well, implement the proposal (approved from AC2)
Further details¶
storage.qe.prg2.suse.org via rsnapshot in /home/rsnapshot
- backup of openqa data (test result files without assets - "test result archive" - e.g. screenshots, video, serial log)
- archive
- fixed isos
- fixed hdd images
backup-vm via rsnapshot /home/rsnapshot
- osd database + /etc
Updated by gpathak 2 months ago
- Copied from action #175791: [alert] storage: partitions usage (%) alert size:S added
Updated by gpathak about 1 month ago
- Status changed from Workable to In Progress
Updated by openqa_review about 1 month ago
- Due date set to 2025-04-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by gpathak about 1 month ago
I have deleted two hdd files from O3:
We still have backup of these files on our storage host, but these will be removed from storage backup after approximately 4 months.
@dheidler Can we reduce the number of rsync backup?
Right now we have 3 backups of alpha and beta and 2 for gamma.
How about reducing each level by 1 to have 2 backups of alpha and beta and 1 for gamma? This way less storage can be used.
Maybe we can revert this once #175791 is resolved or continue with the above proposal if we have enough backup even after reduced number of rsnapshot levels.
Updated by okurz about 1 month ago
gpathak wrote in #note-10:
I have deleted two hdd files from O3:
We still have backup of these files on our storage host, but these will be removed from storage backup after approximately 4 months.
@dheidler Can we reduce the number of rsync backup?
Right now we have 3 backups of alpha and beta and 2 for gamma.
How about reducing each level by 1 to have 2 backups of alpha and beta and 1 for gamma? This way less storage can be used.
How much less storage would that use?
Maybe we can revert this once #175791 is resolved or continue with the above proposal if we have enough backup even after reduced number of rsnapshot levels.
Agreed. This can be a temporary mitigation to ensure we don't run out of storage space and should be reverted once more storage is fitted into the systems.
Updated by gpathak about 1 month ago
okurz wrote in #note-11:
gpathak wrote in #note-10:
I have deleted two hdd files from O3:
We still have backup of these files on our storage host, but these will be removed from storage backup after approximately 4 months.
@dheidler Can we reduce the number of rsync backup?
Right now we have 3 backups of alpha and beta and 2 for gamma.
How about reducing each level by 1 to have 2 backups of alpha and beta and 1 for gamma? This way less storage can be used.
How much less storage would that use?
It would use ~460GiB less if we use 2 alpha snapshot instead of 3.
Updated by gpathak about 1 month ago
gpathak wrote in #note-12:
okurz wrote in #note-11:
gpathak wrote in #note-10:
I have deleted two hdd files from O3:
We still have backup of these files on our storage host, but these will be removed from storage backup after approximately 4 months.
@dheidler Can we reduce the number of rsync backup?
Right now we have 3 backups of alpha and beta and 2 for gamma.
How about reducing each level by 1 to have 2 backups of alpha and beta and 1 for gamma? This way less storage can be used.
How much less storage would that use?It would use ~460GiB less if we use 2 alpha snapshot instead of 3.
Created a MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1411 reduced each level snapshot by 1, so it should use ~1.3TB (460GiB x 3) less space.
Updated by gpathak about 1 month ago ยท Edited
gpathak wrote in #note-13:
Created a MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1411 reduced each level snapshot by 1, so it should use ~1.3TB (460GiB x 3) less space.
We changed the approach and decided to disable use_lazy_delete
, the MR is merged and I have also removed existing single _delete*
directory, the usage is now at 71% storage host disk usage Grafana Dashboard
Updated by gpathak about 1 month ago
- Status changed from In Progress to Resolved