action #64574
closedcoordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results
Keep track of disk usage of results by job groups
0%
Description
Problem and use case¶
The disk space used by job results on OSD is increasing despite the nightly cleanup. It is not clear how the different job groups contribute to that growth. As admin of that instance it would be helpful to know so e.g. cleanup quotas can be adjusted accordingly.
Notes¶
For assets the cleanup quote is defined by a max. total file size for each group and the size of each asset is tracked. For results/logs there is just a time limit for each group and we don't keep track of the occupied disk space at all.
Suggestions¶
- Keep track of the disk space occupied by results/logs.
- The PR https://github.com/os-autoinst/openQA/pull/2845 implements that on job level.
- Statistics for whole groups can be easily accumulated.
- Shared screenshots are accounted to each job using them individually. Not sure whether that is sufficient.
- The PR https://github.com/os-autoinst/openQA/pull/2845 implements that on job level.
- Show the data gathered via 1. for each job group in a graph on https://stats.openqa-monitor.qa.suse.de.
- Extend the result cleanup to allow a file size based quota if it turns out that the time based quote is not useful even with 1. and 2..
Updated by okurz almost 5 years ago
- Subject changed from Keep track of disk space used by results of job groups to Keep track of disk usage of results by job groups
- Description updated (diff)
- Category set to Feature requests
Updated by mkittler almost 5 years ago
My last comment was lost, here just the SQL queries I've been mentioning:
select group_id, (select concat_ws('/', (select name from job_group_parents where id = parent_id), name) from job_groups where id = group_id) as group_name, sum(result_size) as result_size from jobs group by group_id order by group_id;
select group_id, (select concat_ws('/', (select name from job_group_parents where id = parent_id), name) from job_groups where id = group_id) as group_name, (sum(result_size) / 1024 / 1024 / 1024) as result_size_gb from jobs group by group_id order by result_size_gb desc;
select id, test, (select concat_ws('/', (select name from job_group_parents where id = parent_id), name) from job_groups where id = group_id) as group_name, result_size, (result_size / 1024 / 1024) as result_size_mb from jobs where result_size is not null order by result_size desc limit 20;
I'm currently experimenting with using Telegraf locally and have also drafted a MR for our monitoring: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/287
Seems like the actual disk usage and the disk usage now available in the DB are not exactly the same. I guess one problem with my accounting is that it counts screenshots twice if the same screenshot is present in the same test and that's apparently often the case, e.g.
lrwxrwxrwx 1 martchus users 70 24. Mär 10:07 welcome-1.png -> /hdd/openqa-devel/openqa/images/cbb/22d/ea6b09799ac843a799e2f8578e.png
lrwxrwxrwx 1 martchus users 70 24. Mär 10:07 welcome-2.png -> /hdd/openqa-devel/openqa/images/cbb/22d/ea6b09799ac843a799e2f8578e.png
Updated by mkittler almost 5 years ago
- Related to action #64809: Worker uploads some text results possibly multiple times wasting resources added
Updated by mkittler almost 5 years ago
The reason why the figures stored in the database are inaccurate is generally that it accounts what is being uploaded and not what is being stored/linked. That means:
- Screenshots which are already present from a previous test run are not accounted. That means the result size in the DB is smaller.
- Text results might be uploaded twice (but are only stored once). That means the result size in the DB is bigger. With https://github.com/os-autoinst/openQA/pull/2879 in place, this shouldn't be the case anymore.
It seems that for the jobs I've tested 2. outweighs 1. and the result size in the DB is bigger than what du
reports. Nevertheless I suppose the figures are accurate enough to compare them and to identify the disk space eating culprit. However, we should avoid printing them somewhere in the web UI implying that these are exact sizes.
Yet another caveat to mention for people who come from the Grafana dashboard: The accumulated result size for the groups is obviously only the size since recording the result sizes has been started and not the total result size for that group.
Updated by okurz almost 5 years ago
- Related to action #64824: osd /results is at 99%, about to exceed available space added
Updated by mkittler almost 5 years ago
- The PR for the Telegraf query has been merged.
- PR for the Grafana dashboard is ongoing: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/293
- PR for PostgreSQL permissions of Telegram user has been merged but maybe needs amendment: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/292
Updated by okurz almost 5 years ago
As you managed to provide the permissions manually please see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/292#note_206089 why salt failed to do the same.
Updated by mkittler almost 5 years ago
I've update my PR so include a fix for salt (which will hopefully work).
Updated by mkittler almost 5 years ago
- Status changed from In Progress to Feedback
The PR has been merged and the panel is available: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&fullscreen&panelId=19
Updated by mkittler almost 5 years ago
- Status changed from Feedback to In Progress
When deleting logs we should keep track of the freed disk space, otherwise the effect of the cleanup is not visible in the graph until the entire job is deleted. PR: https://github.com/os-autoinst/openQA/pull/2893
Updated by mkittler almost 5 years ago
The mentioned PR has been merged.
The retention policy for the data might need to be adjusted. It would also make sense to perform the PostgreSQL query less frequently.
Updated by mkittler over 4 years ago
- Status changed from In Progress to Feedback
PR to query less frequently (already merged): https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/8401cd98bd4a545ee77020992ccdbe5b6f4893cf
It seems the retention policy can only be set on database level. So it is likely the best to configure a global retention policy at some point (not as part of this task).
Updated by mkittler over 4 years ago
- Status changed from Feedback to Resolved
- Target version deleted (
Current Sprint)
I've created a follow-up ticket for the retention policy: #66019
Not sure what's left to do so I'm closing the ticket as resolved.