coordination #96974: [epic] Improve/reconsider thresholds for skipping cleanup - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

coordination #96974

closed

openQA Project (public) - coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results

openQA Project (public) - coordination #80546: [epic] Scale up: Enable to store more results

[epic] Improve/reconsider thresholds for skipping cleanup

Added by mkittler over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-09-20

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

We've been introducing thresholds to skip the asset/result cleanup if there's enough free disk space anyways. The idea is to reduce the number of cleanup runs which always come with a certain overhead, even if the cleanup wouldn't do anything.

The feature itself works well but leads to problems in production (on OSD), in particular, last week the file system alert triggered (see #96789#note-3). There was nothing really broken; the asset cleanup was just postpone for too long. First it was under the threshold and then blocked by the results cleanup which took very long because there was a lot to clean up because the result cleanup has also been postponed itself.

I've been asking myself the following questions on how to improve this in the future:

Maybe we could change the locking to allow running the cleanup of assets and results concurrently? In our setup results and assets are on different disks so running both at the same time shouldn't be counterproductive and in this case it would have helped. In fact I resolved the mentioned alert by manually deleting the limit_tasks lock to let the asset cleanup run in parallel with the result cleanup.
The last 3 asset cleanup jobs which could have actually ran did not because at this point the threshold hasn't been reached and therefore the cleanup has been skipped. The same counts for the result cleanup which ran before the currently active one. It was skipped because we were under the threshold but that's likely contributing to the fact the first cleanup which actually runs again is taking very long. Maybe we should rethink postponing the cleanup according to the thresholds. At least the current thresholds cause the cleanup to be triggered too late.
Having thresholds for skipping the cleanup would make more sense when it was also considered for aborting the cleanup early. I suppose in this case it would have helped, e.g. then the result cleanup would have aborted earlier so the asset cleanup could have run.

Subtasks 2 (0 open — 2 closed)

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by tinita over 3 years ago

Target version set to future

Actions

Copy link

Updated by mkittler over 3 years ago

Related to coordination #64881: [epic] Reconsider triggering cleanup jobs added

Actions

Copy link

Updated by okurz over 3 years ago

Tracker changed from action to coordination
Subject changed from Improve/reconsider thresholds for skipping cleanup to [epic] Improve/reconsider thresholds for skipping cleanup
Target version changed from future to Ready
Parent task set to #80546