coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results
Reconsider triggering cleanup jobs
Currently the cleanup jobs are triggered time-based using a systemd timer. So far this ticket is just a collection of ideas how we can improve.
The time-based trigger might not be frequent enough, especially if the cleanup is aborted in the middle for some reason. E.g. on o3 we see that results/screenshots might pile up a lot which makes the error rate even higher because in consequence the cleanup jobs can take quite long (see https://progress.opensuse.org/issues/55922) and are possible interrupted with the result "worker went away".
Triggering the cleanup blindly when new jobs are scheduled as we did before is not nice either. It means creating tons of Minion jobs which just terminate because a cleanup job is already running and clutter the dashboard.
- AC1: Idle instances of openQA, e.g. personal single-user developer instances, only trigger cleanup jobs when quota usage is likely to change, e.g. when new builds or jobs are scheduled or jobs complete
- AC2: Cleanup jobs are only triggered when a useful effect is expected, e.g. not 100 times in a row shortly after each other
One possibility to solve this would be that the jobs delete themselves if they can't acquire the lock. Another possibility would be acquiring the lock before creating the job and if that's not possible there will simply be no job (and if it is possible the job needs to adapt the lock).
Note that triggering the cleanup more frequently will not magically solve problems without adjusting quotas for the result storage duration. Now that we keep track of the result size we could additionally add a size-based threshold for results (maybe specify a max. total size and a percentage for each job group).
#4 Updated by okurz about 1 year ago
new mojo minion expiring jobs as in https://github.com/mojolicious/minion/compare/v10.11...v10.12#diff-c112bb3542e98308d12d5ecb10a67abcR2 might help
#5 Updated by okurz about 1 year ago
We discussed this topic in the QA tools weekly meeting 2020-07-28. "Expiring jobs" would expire based on just time and as we want to look into event-based triggers they are not that helpful. But as kraih explained we can go another way: We can create minion jobs that can create locks which expire after a time which can be longer than the time the actual minion job runs. So what we should do:
- Add back trigger for cleanup job
- Use a configurable "dead-time" for locks
- Optional, after that: Based on configuration, call "df", compare free space with a configurable limit and only trigger cleanup job if threshold is exceeded
- can use https://software.opensuse.org/package/perl-Filesys-Df?search_term=perl-FileSys-Df
- can mock "df" in tests to simply give back what we want, e.g. "enough free space available" or "free space exceeded"
- Status changed from Feedback to Blocked
By default the cleanup systemd timers are started as dependency in systemd/openqa-webui.service . Currently the space aware cleanup is triggered also when the timers trigger. We can consider the cases when "jobs post" or "isos post" is called to trigger the space aware cleanup and not have the systemd timer pulled in by default anymore.
waiting for #91782 which we see as related
We also have the feature of keeping a minimum amount of space but have not enabled it in production yet
- Due date set to 2021-08-31
- Status changed from New to Feedback
- Assignee set to cdywan
Some thoughts from the planning poker:
- We may want backups in place for tackling this (#94555)
- Maybe we want event-based triggers instead of (systemd) timers?
- We could have a workshop about this topic? From the user story, personal setups, experience with production setups e.g. assets being cleaned up too soon - see workshop topics
I'll take the ticket and look into a workshop slot with mkittler as a resident expert