coordination #76984
opencoordination #103950: [saga][epic] Scale up: Efficient handling of large storage for multiple independant projects and products
[epic] Automatically remove assets+results based on available free space
33%
Description
Motivation¶
See examples like #76822 : openQA has automatic removal of assets+results but the sum of all configured retention periods and asset quotas can still exceed the available space so that manual administration is required. In case the cleanup based on these parameters can not free enough space we should do the next step and remove more until we have enough free space again. We already do something similar in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON#L18 to remove videos of older test jobs which we identified as a big contributor to space usage.
Acceptance criteria¶
- AC1: the filesystem including the openQA results directory is ensured to have at least a configured amount of free space
Suggestions¶
- Read and understand https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON#L18
- Extend the existing asset+result cleanup to
- check the free space of the filesystem including the assets/results directory
- compare the free space against a configured value, e.g. in openqa.ini
- if free space is below limit after results cleanup remove more data from results checking in each step until free space limit is reached, e.g.
- videos from oldest, non-important jobs first ("oldest first" can mean simply job id numbers ascending order)
- other results from oldest, non-important jobs
- videos from oldest, important jobs
- other results from oldest, important jobs
- if after all steps free space limit could still not be reached, i.e. if all result data was removed, raise error
- the above can be configured as well, e.g. "results_free_space_cleanup_components=non-important-results-videos,non-important-results-other,important-results-videos,important-results-other"
- can use https://software.opensuse.org/package/perl-Filesys-Df?search_term=perl-FileSys-Df
- can mock "df" in tests to simply give back what we want, e.g. "enough free space available" or "free space exceeded"
- Optional: Extend to assets as well
Impact¶
This can also greatly help us as administrators of osd to ensure that /results limits are not exceeded which repeatedly caused us additional administration work.
Workaround¶
Have a periodic job calling "df" and checking against limit, remove results otherwise
Updated by okurz about 4 years ago
- Copied from action #76822: Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents) added
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 4 years ago
- Related to coordination #64881: [epic] Reconsider triggering cleanup jobs added
Updated by mkittler almost 4 years ago
- Related to action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly) added
Updated by mkittler almost 4 years ago
Filesys::Df
would be very simple to use: https://metacpan.org/pod/Filesys::Df- Regarding result cleanup
- We would likely want to run this after the regular cleanup. That's after all
limit_screenshots
tasks are done. So I'd add an additional Minion task and would enqueue such a job to run after alllimit_screenshots
jobs via theparents
argument ofenqueue
(https://metacpan.org/pod/Minion#enqueue1). - After cleaning up a job we need to check whether we are now beyond the configured limit in order to decided whether we need to proceed and cleanup more jobs. The problem is that the screenshots associated with the deleted job are not immediately deleted when deleting the job. Taking care of dangling screenshots is so far only implemented as a separate task (the one mentioned in
2.
) which considers all screenshots. It looks like we need a 2nd way to cleanup screenshots which would not try to do it in batches for all screenshots but only considers the screenshots related to a certain job. Not sure how efficient that would be but maybe it would be ok to run that every time after cleaning up a job while exceeding the free space. Making a query for the screenshots exclusively used by a certain job wouldn't be very difficult but it might be expensive to run, especially since we would possibly need to run it quite often.
- We would likely want to run this after the regular cleanup. That's after all
- Regarding asset cleanup
- We actually already have size limits but over-allocate in practice. So I assume this ticket is about combining the possibility of an over-allocation with a cleanup that ensures we do not actually run out of disk space.
- The previous point still leaves the question which assets should be deleted first. Maybe a 2nd asset cleanup run should be performed after the regular asset cleanup.
- It would use a scaled-down version of the configured quotas. With scaled-down I mean the absolute sizes of each group would be reduced to fit some limit but the proportions would be preserved. So the configured quotas would only serve as a weight factor for the 2nd cleanup.
- It would stop immediately if the threshold for the disk utilization is no longer exceeded. So assets are not needlessly removed.
- I hope that the combination of the previous points
1.
and2.
allows that certain groups can still retain their over-allocated assets as long as enough other groups don't actually use their allocated limit.
- Maybe it makes sense to visualize the scaled-down limits first within openQA's asset statistics.
- It would also be nice to be able to perform a dry-run (with production data) before introducing changes like this.
Updated by openqa_review almost 4 years ago
- Due date set to 2020-12-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler almost 4 years ago
Here a few queries related to the screenshots-to-job mapping in our database which can help with point 2.2 from my previous comment:
number of screenshots per jobs:
openqa-local=> select job_id, count(distinct screenshot_id) as screenshot_count from screenshot_links where job_id = 1801 group by job_id;
number of jobs referencing a screenshot:
select count(distinct job_id) as screenshot_usage from screenshot_links where screenshot_id = 242820;
exclusive screenshots per job:
select distinct screenshot_id from screenshots join screenshot_links on screenshots.id=screenshot_links.screenshot_id where job_id = 1801 and (select count(job_id) as screenshot_usage from screenshot_links where screenshot_id = id and job_id != 1801) = 0;
shared screenshots per job:
select distinct screenshot_id, (select count(distinct job_id) as screenshot_usage from screenshot_links where screenshot_id = id and job_id != 1801) as spread from screenshots join screenshot_links on screenshots.id=screenshot_links.screenshot_id where job_id = 1801 and (select count(job_id) as screenshot_usage from screenshot_links where screenshot_id = id and job_id != 1801) > 0 order by spread desc;
Of course the query exclusive screenshots per job
is the one of interest for this ticket. It runs reasonably fast on my local database. However, on OSD it took so long to execute it that I had to abort it as we have tons of jobs and screenshots there. I suppose the query can still be written in a more optimal way but I wouldn't expect a miracle.
By the way, the distinct
s in these queries are required because the screenshot_links
contains a LOT duplicates. I'm wondering why we don't have a unique constraint for the pair of screenshot_id
and job_id
. Even in my local database I see the same job-to-screenshot mapping over 200 times. That's certainly something we might want to improve although it is of course out-of-scope for this ticket.
Updated by mkittler almost 4 years ago
The query for exclusive screenshots per job can be easily improved. The following query returns in ~31 ms on OSD which is acceptable:
select distinct screenshot_id from screenshots join screenshot_links on screenshots.id=screenshot_links.screenshot_id where job_id = 5147889 and not exists(select job_id as screenshot_usage from screenshot_links where screenshot_id = id and job_id != 5147889 limit 1);
Without the The 2nd run was just faster. The explicit distinct
it goes even down to ~18 ms so if we can cope with duplicates later we could consider avoid using it here.limit 1
can also be omitted because PostgreSQL seems to be smart enough.
Updated by livdywan almost 4 years ago
- Due date changed from 2020-12-24 to 2021-01-08
- Status changed from Workable to Feedback
I suppose this is still being researched, hence setting to Feedback. Also bumping the due date to account for holidays.
Updated by livdywan almost 4 years ago
Updated by okurz almost 4 years ago
In today's meeting we discussed a couple of things. One of the last point we mentioned what could be done is to just add the df dependency and trigger the cleanup as soon as df reports not enough space without changing the cleanup implementation. In other words: Whenever new jobs are triggered or would be triggered, call df, compare against configured limit, if not enough free space trigger cleanup and not wait for next periodic, e.g. "nightly", cleanup job. Please split that into a subtask and turn this ticket into epic.
Please for now work under the assumption that calling df is cheap and precise enough.
Updated by okurz almost 4 years ago
- Subject changed from Automatically remove assets+results based on available free space to [epic] Automatically remove assets+results based on available free space
created subtask #88121
Updated by mkittler almost 4 years ago
More points from the discussion:
- In the end the "df computation" should be exchangeable with a custom script to return the free percentage to cope with more complicated setups and file systems.
- There could be a dry-run which would run only the video deletion steps (which don't rely on calling df after each deleted job). That would be useful for testing.
- The UI should make it clear that the storage durations are not guaranteed.
Updated by mkittler almost 4 years ago
The PR https://github.com/os-autoinst/openQA/pull/3635 has been merged. I had to remove usages of df during the cleanup. That means it would now be actually easy to provide a dry-run. I think it is worth implementing a dry-run feature so we can enable it in production with more confidence that it won't delete too much. So that would be my next step.
Updated by mkittler almost 4 years ago
The dry run is still not that easy after all because the screenshot deletion needed to take into account which jobs would have been deleted so far. Maybe I could use a database transaction for that.
I've also noticed that there's one bug I need to fix: So far the size of symlinks (or better their targets) is taken into account but that shouldn't be the case here.
Updated by mkittler almost 4 years ago
So far the size of symlinks (or better their targets) is taken into account but that shouldn't be the case here.
A fix for that has already been merged: https://github.com/os-autoinst/openQA/pull/3705
As already mentioned, the dry-run would be more work to implement as I thought. It looks like I'd needed to introduce quite some dry-run specific code which would defeat the point of having the dry-run in the first place. So I won't create a PR for that after all. Maybe some people in the team like to help testing the feature by enabling results_min_free_disk_space_percentage
within [misc_limits]
locally? It would make sense to check whether df returns something that makes sense, e.g. check whether the output of script/openqa eval -V use Filesys::Df; Filesys::Df::df(OpenQA::Utils::resultdir, 1)
makes sense.
Updated by okurz almost 4 years ago
mkittler wrote:
It would make sense to check whether df returns something that makes sense, e.g. check whether the output of
script/openqa eval -V use Filesys::Df; Filesys::Df::df(OpenQA::Utils::resultdir, 1)
makes sense.
okurz@ariel:~> sudo -u geekotest /usr/share/openqa/script/openqa eval -V 'use Filesys::Df; Filesys::Df:: df(OpenQA::Utils::resultdir, 1)'
{
"bavail" => '2946475061248',
"bfree" => '2946475061248',
"blocks" => '5495946461184',
"favail" => 2029143627,
"ffree" => 2029143627,
"files" => 2147483200,
"fper" => 6,
"fused" => 118339573,
"per" => 46,
"su_bavail" => '2946475061248',
"su_blocks" => '5495946461184',
"su_favail" => 2029143627,
"su_files" => 2147483200,
"used" => '2549471399936',
"user_bavail" => '2946475061248',
"user_blocks" => '5495946461184',
"user_favail" => 2029143627,
"user_files" => 2147483200,
"user_fused" => 118339573,
"user_used" => '2549471399936'
}
okurz@ariel:~> df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/vdb1 5.0T 2.4T 2.7T 47% /space
/dev/mapper/vg0-assets 3.0T 1.8T 1.3T 57% /assets
...
okurz@ariel:~> df
Filesystem 1K-blocks Used Available Use% Mounted on
...
/dev/vdb1 5367135216 2489716592 2877418624 47% /space
/dev/vdc 104847360 25452500 79394860 25% /var/lib/pgsql
/dev/mapper/vg0-assets 3219652608 1829882016 1389770592 57% /assets
...
okurz@ariel:~> echo $((2877418624*1024))
2946476670976
so, ... yes?
Updated by mkittler almost 4 years ago
I guess it makes sense. Note that /space/snapshot-changes/opensuse
is on the same partition. Not sure what it is used for but it would of course be problematic if it could possibly fill the entire disk space and might need its own cleanup.
Updated by okurz almost 4 years ago
mkittler wrote:
I guess it makes sense. Note that
/space/snapshot-changes/opensuse
is on the same partition. Not sure what it is used for but it would of course be problematic if it could possibly fill the entire disk space and might need its own cleanup.
True but please consider that out-of-scope. You don't need to care about that, i.e. if df reports below configured threshold, delete results, regardless what service filled up the space.
Updated by livdywan almost 4 years ago
Does the above confirmation mean this can be considered done?
Updated by okurz almost 4 years ago
you mean if the epic can be resolved? No, we are not there yet.
Updated by livdywan almost 4 years ago
okurz wrote:
you mean if the epic can be resolved? No, we are not there yet.
The comments and ACs suggest it's done. Maybe a good idea to reflect here what's still missing.
I would suggest to keep discussions about fixes to subtasks if you're not trying to resolve the epic.
Updated by okurz almost 4 years ago
cdywan wrote:
okurz wrote:
you mean if the epic can be resolved? No, we are not there yet.
The comments and ACs suggest it's done. Maybe a good idea to reflect here what's still missing.
I would suggest to keep discussions about fixes to subtasks if you're not trying to resolve the epic.
Maybe you trust in our comments too much. But where do you read that we have the ACs covered? As long as there is no proof that we prevent storage fillup by deleting results based on a configured threshold to keep free the epic is not complete. And we where merely discussing implementation ideas and what "df" reports.
Updated by livdywan almost 4 years ago
okurz wrote:
cdywan wrote:
okurz wrote:
you mean if the epic can be resolved? No, we are not there yet.
The comments and ACs suggest it's done. Maybe a good idea to reflect here what's still missing.
I would suggest to keep discussions about fixes to subtasks if you're not trying to resolve the epic.
Maybe you trust in our comments too much. But where do you read that we have the ACs covered? As long as there is no proof that we prevent storage fillup by deleting results based on a configured threshold to keep free the epic is not complete. And we where merely discussing implementation ideas and what "df" reports.
Because we have code that "ensures that we have a configured amount of free space" and it reads to me like you're discussing the existing implementation. Hence, what additional steps are we planning here? Do we want a subticket about overriding df? Or defining the clean-up schedule? Or something else?
Updated by okurz almost 4 years ago
cdywan wrote:
Because we have code that "ensures that we have a configured amount of free space"
Well, we need a proof. And for that we need that feature enabled on machines
Do we want a subticket about overriding df?
I don't see what that would bring
Or defining the clean-up schedule?
Maybe, don't know what you mean
Or something else?
Well, in the end we want to have that enabled on both osd+o3. That should all be part of the epic.
Updated by mkittler almost 4 years ago
- Assignee deleted (
mkittler)
The whole epic is not what I've signed up for.
Updated by okurz almost 4 years ago
- Status changed from Feedback to Blocked
- Assignee set to okurz
Updated by okurz over 3 years ago
- Tracker changed from action to coordination
- Status changed from Blocked to New
- Assignee deleted (
okurz)
subtask is resolved. Further specific actions should be discussed to be able to continue here.
Updated by okurz over 3 years ago
- Related to action #91782: Add support for archived jobs added
Updated by okurz over 3 years ago
- Status changed from New to Blocked
- Assignee set to okurz
waiting for #91782 which we see as related
We also have the feature of keeping a minimum amount of space but have not enabled it in production yet
Updated by okurz over 3 years ago
- Status changed from Blocked to New
- Assignee deleted (
okurz) - Target version changed from Ready to future
Updated by okurz over 1 year ago
okurz wrote:
waiting for #91782 which we see as related
We also have the feature of keeping a minimum amount of space but have not enabled it in production yet
We are hit #129244 so we are wondering if we actually have that feature or just have not enabled it properly for OSD. That feature has been added with https://github.com/os-autoinst/openQA/pull/3635
Created two new tickets with those ideas: