Project

General

Profile

action #64574

coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results

Keep track of disk usage of results by job groups

Added by mkittler over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2020-03-18
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Problem and use case

The disk space used by job results on OSD is increasing despite the nightly cleanup. It is not clear how the different job groups contribute to that growth. As admin of that instance it would be helpful to know so e.g. cleanup quotas can be adjusted accordingly.

Notes

For assets the cleanup quote is defined by a max. total file size for each group and the size of each asset is tracked. For results/logs there is just a time limit for each group and we don't keep track of the occupied disk space at all.

Suggestions

  1. Keep track of the disk space occupied by results/logs.
    • The PR https://github.com/os-autoinst/openQA/pull/2845 implements that on job level.
      • Statistics for whole groups can be easily accumulated.
      • Shared screenshots are accounted to each job using them individually. Not sure whether that is sufficient.
  2. Show the data gathered via 1. for each job group in a graph on https://stats.openqa-monitor.qa.suse.de.
  3. Extend the result cleanup to allow a file size based quota if it turns out that the time based quote is not useful even with 1. and 2..

Related issues

Related to openQA Project - action #64809: Worker uploads some text results possibly multiple times wasting resourcesResolved2020-03-25

Related to openQA Infrastructure - action #64824: osd /results is at 99%, about to exceed available spaceResolved2020-03-25

History

#1 Updated by okurz over 2 years ago

  • Subject changed from Keep track of disk space used by results of job groups to Keep track of disk usage of results by job groups
  • Description updated (diff)
  • Category set to Feature requests

#2 Updated by mkittler over 2 years ago

My last comment was lost, here just the SQL queries I've been mentioning:

select group_id, (select concat_ws('/', (select name from job_group_parents where id = parent_id), name) from job_groups where id = group_id) as group_name, sum(result_size) as result_size from jobs group by group_id order by group_id;
select group_id, (select concat_ws('/', (select name from job_group_parents where id = parent_id), name) from job_groups where id = group_id) as group_name, (sum(result_size) / 1024 / 1024 / 1024) as result_size_gb from jobs group by group_id order by result_size_gb desc;
select id, test, (select concat_ws('/', (select name from job_group_parents where id = parent_id), name) from job_groups where id = group_id) as group_name, result_size, (result_size / 1024 / 1024) as  result_size_mb from jobs where result_size is not null order by result_size desc limit 20;

I'm currently experimenting with using Telegraf locally and have also drafted a MR for our monitoring: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/287


Seems like the actual disk usage and the disk usage now available in the DB are not exactly the same. I guess one problem with my accounting is that it counts screenshots twice if the same screenshot is present in the same test and that's apparently often the case, e.g.

lrwxrwxrwx 1 martchus users       70 24. Mär 10:07 welcome-1.png -> /hdd/openqa-devel/openqa/images/cbb/22d/ea6b09799ac843a799e2f8578e.png
lrwxrwxrwx 1 martchus users       70 24. Mär 10:07 welcome-2.png -> /hdd/openqa-devel/openqa/images/cbb/22d/ea6b09799ac843a799e2f8578e.png

#3 Updated by mkittler over 2 years ago

  • Related to action #64809: Worker uploads some text results possibly multiple times wasting resources added

#4 Updated by mkittler over 2 years ago

The reason why the figures stored in the database are inaccurate is generally that it accounts what is being uploaded and not what is being stored/linked. That means:

  1. Screenshots which are already present from a previous test run are not accounted. That means the result size in the DB is smaller.
  2. Text results might be uploaded twice (but are only stored once). That means the result size in the DB is bigger. With https://github.com/os-autoinst/openQA/pull/2879 in place, this shouldn't be the case anymore.

It seems that for the jobs I've tested 2. outweighs 1. and the result size in the DB is bigger than what du reports. Nevertheless I suppose the figures are accurate enough to compare them and to identify the disk space eating culprit. However, we should avoid printing them somewhere in the web UI implying that these are exact sizes.

Yet another caveat to mention for people who come from the Grafana dashboard: The accumulated result size for the groups is obviously only the size since recording the result sizes has been started and not the total result size for that group.

#5 Updated by okurz about 2 years ago

  • Related to action #64824: osd /results is at 99%, about to exceed available space added

#6 Updated by mkittler about 2 years ago

#7 Updated by okurz about 2 years ago

As you managed to provide the permissions manually please see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/292#note_206089 why salt failed to do the same.

#8 Updated by mkittler about 2 years ago

I've update my PR so include a fix for salt (which will hopefully work).

#9 Updated by mkittler about 2 years ago

  • Status changed from In Progress to Feedback

#10 Updated by okurz about 2 years ago

  • Parent task set to #64746

#11 Updated by mkittler about 2 years ago

  • Status changed from Feedback to In Progress

When deleting logs we should keep track of the freed disk space, otherwise the effect of the cleanup is not visible in the graph until the entire job is deleted. PR: https://github.com/os-autoinst/openQA/pull/2893

#12 Updated by mkittler about 2 years ago

The mentioned PR has been merged.


The retention policy for the data might need to be adjusted. It would also make sense to perform the PostgreSQL query less frequently.

#13 Updated by mkittler about 2 years ago

  • Status changed from In Progress to Feedback

PR to query less frequently (already merged): https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/8401cd98bd4a545ee77020992ccdbe5b6f4893cf

It seems the retention policy can only be set on database level. So it is likely the best to configure a global retention policy at some point (not as part of this task).

#14 Updated by mkittler about 2 years ago

  • Status changed from Feedback to Resolved
  • Target version deleted (Current Sprint)

I've created a follow-up ticket for the retention policy: #66019

Not sure what's left to do so I'm closing the ticket as resolved.

Also available in: Atom PDF