coordination #96974
closedopenQA Project (public) - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results
openQA Project (public) - coordination #80546: [epic] Scale up: Enable to store more results
[epic] Improve/reconsider thresholds for skipping cleanup
100%
Description
We've been introducing thresholds to skip the asset/result cleanup if there's enough free disk space anyways. The idea is to reduce the number of cleanup runs which always come with a certain overhead, even if the cleanup wouldn't do anything.
The feature itself works well but leads to problems in production (on OSD), in particular, last week the file system alert triggered (see #96789#note-3). There was nothing really broken; the asset cleanup was just postpone for too long. First it was under the threshold and then blocked by the results cleanup which took very long because there was a lot to clean up because the result cleanup has also been postponed itself.
I've been asking myself the following questions on how to improve this in the future:
- Maybe we could change the locking to allow running the cleanup of assets and results concurrently? In our setup results and assets are on different disks so running both at the same time shouldn't be counterproductive and in this case it would have helped. In fact I resolved the mentioned alert by manually deleting the limit_tasks lock to let the asset cleanup run in parallel with the result cleanup.
- The last 3 asset cleanup jobs which could have actually ran did not because at this point the threshold hasn't been reached and therefore the cleanup has been skipped. The same counts for the result cleanup which ran before the currently active one. It was skipped because we were under the threshold but that's likely contributing to the fact the first cleanup which actually runs again is taking very long. Maybe we should rethink postponing the cleanup according to the thresholds. At least the current thresholds cause the cleanup to be triggered too late.
- Having thresholds for skipping the cleanup would make more sense when it was also considered for aborting the cleanup early. I suppose in this case it would have helped, e.g. then the result cleanup would have aborted earlier so the asset cleanup could have run.
Updated by mkittler over 3 years ago
- Related to coordination #64881: [epic] Reconsider triggering cleanup jobs added
Updated by okurz about 3 years ago
- Tracker changed from action to coordination
- Subject changed from Improve/reconsider thresholds for skipping cleanup to [epic] Improve/reconsider thresholds for skipping cleanup
- Target version changed from future to Ready
- Parent task set to #80546
We are again hitting problems, right now on osd where assets space was depleted. Adding to backlog as epic.
I suggest to start with the "run asset cleanup concurrently to results based on config"
Updated by okurz almost 3 years ago
- Status changed from New to Blocked
- Assignee set to okurz
- Target version changed from future to Ready
Updated by okurz almost 3 years ago
With additional improvements in related areas and as soon as o3 concurrent cleanup works we can call this resolved.