coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results
coordination #76984: [epic] Automatically remove assets+results based on available free space
Trigger cleanup of results (or assets) if not enough free space based on configuration limit
See parent epic #76984 . To be able to progress with #76984 we should try to split out smaller simple stories and start with implementing "df" calls in general. This would also allow us to gather experience if calling df is cheap and reliable enough
- AC1: Regular cleanup of results (or assets) is triggered if free space for results (or assets) is below configured limit
- AC2: If no free space limit is configured no df check is called and no cleanup is triggered
- Extend the existing asset+result cleanup to
- check the free space of the filesystem including the assets/results directory
- compare the free space against a configured value, e.g. in openqa.ini
- trigger the same cleanup that we would trigger from the systemd timer
- can use https://software.opensuse.org/package/perl-Filesys-Df?search_term=perl-FileSys-Df
- can mock "df" in tests to simply give back what we want, e.g. "enough free space available" or "free space exceeded"
- Optional: Extend to assets as well
- Assignee set to mkittler
This feature is actually orthogonal to the related epic. The only common part is the use of df/statfs. As discussed for the other ticket, we should make that part configurable at some point.
I suppose I would create a separate script for this which checks df/statfs (sharing existing code introduced by https://github.com/os-autoinst/openQA/pull/3635). This script would enqueue the cleanup tasks depending on the outcome of the check. The script itself would be invoked by a systemd timer (which a higher frequency than the current timer which enqueues the cleanup unconditionally).
Your wording "split out" makes it sound like you've took a big ticket and split the task up. However, this ticket seems to be a different task. The only thing in common with the task from the epic is that df/statfs() is involved. I have not been talking about any use-cases here so far but simply about what needs to be done to fulfill the ACs of the tickets.
However, I also don't see the use case for this ticket. Just triggering the current cleanup logic when running out of disk space will not necessarily prevent us from running out of disk space so it isn't really helpful in that regard. This feature could lower the frequency the cleanup is triggered. However, there's no practical benefit. If the cleanup it triggered more often than necessary that shouldn't cause any harm.
Actually I think that this story could help to cover #64881 . When we trigger cleanup whenever there is need to cleanup then we do not need to trigger the complete cleanup by timer but instead just call the cheaper "df", right? And the task to call "df" could be done by a periodic minion job instead of a systemd timer so that this functionality needs no further system administration by the user and can work in systemd-less variants as well. The second benefit would be by checking more often that the cleanup hysteresis curve would be smaller and hence less problems with "flickering alerts" in grafana, right?
by a periodic minion job instead of a systemd timer
We usually use systemd timers to enqueue Minion jobs so they run periodically. I don't know whether it is possible within the Minion framework itself to enqueue jobs periodically. (We can not do that from the main web UI service because it uses preforking. Technically we could enqueue Minion jobs periodically from any of the other services which do not use preforking.)
When we trigger cleanup whenever there is need to cleanup then we do not need to trigger the complete cleanup by timer but instead just call the cheaper "df", right?
The second benefit would be by checking more often that the cleanup hysteresis curve would be smaller and hence less problems with "flickering alerts" in grafana, right?
Yes, but again that's not really helping with #76984 which is for extending the cleanup algorithm (to solve running out of disk space due to over-allocation) and this ticket is for just triggering it more often (also to solve running out of disk space but due to running the cleanup not frequently enough and in addition to save resources by avoiding unnecessary cleanups). So these tickets address different problems.
Discussed with mkittler: Next steps
- Add "early-abort" in cleanup jobs based on df-output with configurable minimum space to keep free. With this we re-use the existing systemd time based triggering for the space-aware check
- When we have the early-abort we can configure the systemd timer to trigger more often (instance based) to reduce the hysteresis size
- Trigger cleanup also when creating new jobs, see #64881
- Status changed from In Progress to Feedback
- PR has been merged
- SR for OSD: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/456
- Configured the same on o3 manually
- Status changed from Feedback to Resolved
It seems to work in production:
"result" => "Skipping, free disk space on '/var/lib/openqa/share/factory' exceeds configured percentage 20 % (free percentage: 22.9447968855517 %)",
(result cleanup not skipped)
martchus@openqa:~> df -h /assets /results Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf /dev/vdc 7,0T 5,5T 1,6T 79% /assets /dev/vdd 5,5T 4,5T 1,1T 81% /results
"result" => "Skipping, free disk space on '/var/lib/openqa/share/factory' exceeds configured percentage 20 % (free percentage: 29.5415074793063 %)", "result" => "Skipping, free disk space on '/var/lib/openqa/testresults' exceeds configured percentage 20 % (free percentage: 45.2762918429145 %)",
martchus@ariel:~> df -h /assets /var/lib/openqa/testresults Dateisystem Größe Benutzt Verf. Verw% Eingehängt auf /dev/mapper/vg0-assets 3,0T 2,2T 894G 71% /assets /dev/vdb1 5,0T 2,8T 2,3T 56% /var/lib/openqa