Project

General

Profile

coordination #96974

openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results

openQA Project - coordination #80546: [epic] Scale up: Enable to store more results

[epic] Improve/reconsider thresholds for skipping cleanup

Added by mkittler 5 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-09-20
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

We've been introducing thresholds to skip the asset/result cleanup if there's enough free disk space anyways. The idea is to reduce the number of cleanup runs which always come with a certain overhead, even if the cleanup wouldn't do anything.

The feature itself works well but leads to problems in production (on OSD), in particular, last week the file system alert triggered (see #96789#note-3). There was nothing really broken; the asset cleanup was just postpone for too long. First it was under the threshold and then blocked by the results cleanup which took very long because there was a lot to clean up because the result cleanup has also been postponed itself.

I've been asking myself the following questions on how to improve this in the future:

  1. Maybe we could change the locking to allow running the cleanup of assets and results concurrently? In our setup results and assets are on different disks so running both at the same time shouldn't be counterproductive and in this case it would have helped. In fact I resolved the mentioned alert by manually deleting the limit_tasks lock to let the asset cleanup run in parallel with the result cleanup.
  2. The last 3 asset cleanup jobs which could have actually ran did not because at this point the threshold hasn't been reached and therefore the cleanup has been skipped. The same counts for the result cleanup which ran before the currently active one. It was skipped because we were under the threshold but that's likely contributing to the fact the first cleanup which actually runs again is taking very long. Maybe we should rethink postponing the cleanup according to the thresholds. At least the current thresholds cause the cleanup to be triggered too late.
  3. Having thresholds for skipping the cleanup would make more sense when it was also considered for aborting the cleanup early. I suppose in this case it would have helped, e.g. then the result cleanup would have aborted earlier so the asset cleanup could have run.

Subtasks

action #98922: Run asset cleanup concurrently to results based on configResolvedmkittler

action #103954: Run asset cleanup concurrently to results based on config on o3 as wellResolvedokurz


Related issues

Related to openQA Project - coordination #64881: [epic] Reconsider triggering cleanup jobsNew2021-08-31

History

#1 Updated by tinita 5 months ago

  • Target version set to future

#2 Updated by mkittler 5 months ago

#3 Updated by okurz 4 months ago

  • Tracker changed from action to coordination
  • Subject changed from Improve/reconsider thresholds for skipping cleanup to [epic] Improve/reconsider thresholds for skipping cleanup
  • Target version changed from future to Ready
  • Parent task set to #80546

We are again hitting problems, right now on osd where assets space was depleted. Adding to backlog as epic.

I suggest to start with the "run asset cleanup concurrently to results based on config"

#4 Updated by okurz 4 months ago

Less critical with #99246 fixed

#5 Updated by okurz 4 months ago

  • Target version changed from Ready to future

#6 Updated by okurz about 1 month ago

  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version changed from future to Ready

#7 Updated by okurz about 1 month ago

With additional improvements in related areas and as soon as o3 concurrent cleanup works we can call this resolved.

#8 Updated by okurz about 1 month ago

  • Status changed from Blocked to Resolved

Also available in: Atom PDF