Project

General

Profile

action #76984

coordination #64746: [saga][epic] Scale up: Handle large storage efficiently to be able to run current tests efficiently but keep big archives of old results

Automatically remove assets+results based on available free space

Added by okurz 3 months ago. Updated 8 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

See examples like #76822 : openQA has automatic removal of assets+results but the sum of all configured retention periods and asset quotas can still exceed the available space so that manual administration is required. In case the cleanup based on these parameters can not free enough space we should do the next step and remove more until we have enough free space again. We already do something similar in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON#L18 to remove videos of older test jobs which we identified as a big contributor to space usage.

Acceptance criteria

  • AC1: the filesystem including the openQA results directory is ensured to have at least a configured amount of free space

Suggestions

  • Read and understand https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON#L18
  • Extend the existing asset+result cleanup to
    • check the free space of the filesystem including the assets/results directory
    • compare the free space against a configured value, e.g. in openqa.ini
    • if free space is below limit after results cleanup remove more data from results checking in each step until free space limit is reached, e.g.
    • videos from oldest, non-important jobs first ("oldest first" can mean simply job id numbers ascending order)
    • other results from oldest, non-important jobs
    • videos from oldest, important jobs
    • other results from oldest, important jobs
    • if after all steps free space limit could still not be reached, i.e. if all result data was removed, raise error
    • the above can be configured as well, e.g. "results_free_space_cleanup_components=non-important-results-videos,non-important-results-other,important-results-videos,important-results-other"
  • can use https://software.opensuse.org/package/perl-Filesys-Df?search_term=perl-FileSys-Df
  • can mock "df" in tests to simply give back what we want, e.g. "enough free space available" or "free space exceeded"
  • Optional: Extend to assets as well

Impact

This can also greatly help us as administrators of osd to ensure that /results limits are not exceeded which repeatedly caused us additional administration work.

Workaround

Have a periodic job calling "df" and checking against limit, remove results otherwise


Related issues

Related to openQA Project - action #64881: Reconsider triggering cleanup jobsBlocked2020-03-26

Related to openQA Infrastructure - action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly)Workable2020-06-14

Copied from openQA Infrastructure - action #76822: Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)Resolved2020-10-302020-11-13

History

#1 Updated by okurz 3 months ago

  • Copied from action #76822: Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents) added

#2 Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by okurz 3 months ago

  • Related to action #64881: Reconsider triggering cleanup jobs added

#4 Updated by okurz 3 months ago

  • Description updated (diff)

#5 Updated by okurz about 2 months ago

  • Parent task set to #64746

#6 Updated by mkittler about 1 month ago

  • Related to action #68053: powerqaworker-qam-1 fails to come up on reboot (repeatedly) added

#7 Updated by mkittler about 1 month ago

  • Assignee set to mkittler

#8 Updated by mkittler about 1 month ago

  1. Filesys::Df would be very simple to use: https://metacpan.org/pod/Filesys::Df
  2. Regarding result cleanup
    1. We would likely want to run this after the regular cleanup. That's after all limit_screenshots tasks are done. So I'd add an additional Minion task and would enqueue such a job to run after all limit_screenshots jobs via the parents argument of enqueue (https://metacpan.org/pod/Minion#enqueue1).
    2. After cleaning up a job we need to check whether we are now beyond the configured limit in order to decided whether we need to proceed and cleanup more jobs. The problem is that the screenshots associated with the deleted job are not immediately deleted when deleting the job. Taking care of dangling screenshots is so far only implemented as a separate task (the one mentioned in 2.) which considers all screenshots. It looks like we need a 2nd way to cleanup screenshots which would not try to do it in batches for all screenshots but only considers the screenshots related to a certain job. Not sure how efficient that would be but maybe it would be ok to run that every time after cleaning up a job while exceeding the free space. Making a query for the screenshots exclusively used by a certain job wouldn't be very difficult but it might be expensive to run, especially since we would possibly need to run it quite often.
  3. Regarding asset cleanup
    1. We actually already have size limits but over-allocate in practice. So I assume this ticket is about combining the possibility of an over-allocation with a cleanup that ensures we do not actually run out of disk space.
    2. The previous point still leaves the question which assets should be deleted first. Maybe a 2nd asset cleanup run should be performed after the regular asset cleanup.
      1. It would use a scaled-down version of the configured quotas. With scaled-down I mean the absolute sizes of each group would be reduced to fit some limit but the proportions would be preserved. So the configured quotas would only serve as a weight factor for the 2nd cleanup.
      2. It would stop immediately if the threshold for the disk utilization is no longer exceeded. So assets are not needlessly removed.
      3. I hope that the combination of the previous points 1. and 2. allows that certain groups can still retain their over-allocated assets as long as enough other groups don't actually use their allocated limit.
    3. Maybe it makes sense to visualize the scaled-down limits first within openQA's asset statistics.
    4. It would also be nice to be able to perform a dry-run (with production data) before introducing changes like this.

#9 Updated by openqa_review about 1 month ago

  • Due date set to 2020-12-24

Setting due date based on mean cycle time of SUSE QE Tools

#10 Updated by mkittler about 1 month ago

Here a few queries related to the screenshots-to-job mapping in our database which can help with point 2.2 from my previous comment:

number of screenshots per jobs:
openqa-local=> select job_id, count(distinct screenshot_id) as screenshot_count from screenshot_links where job_id = 1801 group by job_id;

number of jobs referencing a screenshot:
select count(distinct job_id) as screenshot_usage from screenshot_links where screenshot_id = 242820;

exclusive screenshots per job:
select distinct screenshot_id from screenshots join screenshot_links on screenshots.id=screenshot_links.screenshot_id where job_id = 1801 and (select count(job_id) as screenshot_usage from screenshot_links where screenshot_id = id and job_id != 1801) = 0;

shared screenshots per job:
select distinct screenshot_id, (select count(distinct job_id) as screenshot_usage from screenshot_links where screenshot_id = id and job_id != 1801) as spread from screenshots join screenshot_links on screenshots.id=screenshot_links.screenshot_id where job_id = 1801 and (select count(job_id) as screenshot_usage from screenshot_links where screenshot_id = id and job_id != 1801) > 0 order by spread desc;

Of course the query exclusive screenshots per job is the one of interest for this ticket. It runs reasonably fast on my local database. However, on OSD it took so long to execute it that I had to abort it as we have tons of jobs and screenshots there. I suppose the query can still be written in a more optimal way but I wouldn't expect a miracle.

By the way, the distincts in these queries are required because the screenshot_links contains a LOT duplicates. I'm wondering why we don't have a unique constraint for the pair of screenshot_id and job_id. Even in my local database I see the same job-to-screenshot mapping over 200 times. That's certainly something we might want to improve although it is of course out-of-scope for this ticket.

#11 Updated by mkittler about 1 month ago

The query for exclusive screenshots per job can be easily improved. The following query returns in ~31 ms on OSD which is acceptable:

select distinct screenshot_id from screenshots join screenshot_links on screenshots.id=screenshot_links.screenshot_id where job_id = 5147889 and not exists(select job_id as screenshot_usage from screenshot_links where screenshot_id = id and job_id != 5147889 limit 1);

Without the distinct it goes even down to ~18 ms so if we can cope with duplicates later we could consider avoid using it here. The 2nd run was just faster. The explicit limit 1 can also be omitted because PostgreSQL seems to be smart enough.

#12 Updated by mkittler about 1 month ago

  • Description updated (diff)

#13 Updated by cdywan 23 days ago

  • Due date changed from 2020-12-24 to 2021-01-08
  • Status changed from Workable to Feedback

I suppose this is still being researched, hence setting to Feedback. Also bumping the due date to account for holidays.

#14 Updated by cdywan 8 days ago

  • Due date deleted (2021-01-08)

Also available in: Atom PDF