Project

General

Profile

Actions

coordination #96974

closed

openQA Project - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results

openQA Project - coordination #80546: [epic] Scale up: Enable to store more results

[epic] Improve/reconsider thresholds for skipping cleanup

Added by mkittler over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-09-20
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

We've been introducing thresholds to skip the asset/result cleanup if there's enough free disk space anyways. The idea is to reduce the number of cleanup runs which always come with a certain overhead, even if the cleanup wouldn't do anything.

The feature itself works well but leads to problems in production (on OSD), in particular, last week the file system alert triggered (see #96789#note-3). There was nothing really broken; the asset cleanup was just postpone for too long. First it was under the threshold and then blocked by the results cleanup which took very long because there was a lot to clean up because the result cleanup has also been postponed itself.

I've been asking myself the following questions on how to improve this in the future:

  1. Maybe we could change the locking to allow running the cleanup of assets and results concurrently? In our setup results and assets are on different disks so running both at the same time shouldn't be counterproductive and in this case it would have helped. In fact I resolved the mentioned alert by manually deleting the limit_tasks lock to let the asset cleanup run in parallel with the result cleanup.
  2. The last 3 asset cleanup jobs which could have actually ran did not because at this point the threshold hasn't been reached and therefore the cleanup has been skipped. The same counts for the result cleanup which ran before the currently active one. It was skipped because we were under the threshold but that's likely contributing to the fact the first cleanup which actually runs again is taking very long. Maybe we should rethink postponing the cleanup according to the thresholds. At least the current thresholds cause the cleanup to be triggered too late.
  3. Having thresholds for skipping the cleanup would make more sense when it was also considered for aborting the cleanup early. I suppose in this case it would have helped, e.g. then the result cleanup would have aborted earlier so the asset cleanup could have run.

Subtasks 2 (0 open2 closed)

action #98922: Run asset cleanup concurrently to results based on configResolvedmkittler2021-09-20

Actions
action #103954: Run asset cleanup concurrently to results based on config on o3 as wellResolvedokurz

Actions

Related issues 1 (1 open0 closed)

Related to openQA Project - coordination #64881: [epic] Reconsider triggering cleanup jobsNew2021-08-31

Actions
Actions #1

Updated by tinita over 2 years ago

  • Target version set to future
Actions #2

Updated by mkittler over 2 years ago

Actions #3

Updated by okurz over 2 years ago

  • Tracker changed from action to coordination
  • Subject changed from Improve/reconsider thresholds for skipping cleanup to [epic] Improve/reconsider thresholds for skipping cleanup
  • Target version changed from future to Ready
  • Parent task set to #80546

We are again hitting problems, right now on osd where assets space was depleted. Adding to backlog as epic.

I suggest to start with the "run asset cleanup concurrently to results based on config"

Actions #4

Updated by okurz over 2 years ago

Less critical with #99246 fixed

Actions #5

Updated by okurz over 2 years ago

  • Target version changed from Ready to future
Actions #6

Updated by okurz over 2 years ago

  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version changed from future to Ready
Actions #7

Updated by okurz over 2 years ago

With additional improvements in related areas and as soon as o3 concurrent cleanup works we can call this resolved.

Actions #8

Updated by okurz over 2 years ago

  • Status changed from Blocked to Resolved
Actions

Also available in: Atom PDF