action #55178
closed
- Tracker changed from tickets to action
- Project changed from openSUSE admin to openQA Infrastructure (public)
- Priority changed from Normal to Urgent
- Private changed from Yes to No
I did not yet acknowledge the monitoring alert as I fear it might be ignored for too long then.
- Description updated (diff)
As of today we are at 97%. Trying to handle it. Running du is no fun though :( Reduced some job group sizes over https://openqa.suse.de/admin/assets and triggered another cleanup. Down to 96% already.
Now https://openqa.suse.de/admin/assets lists "Assets by group (total 3803.24 GiB)" out of 8807GiB currently used, leaves 5004GiB to other stuff, e.g. test results, thumbnails, logs. IMHO it makes sense if test results are taking more space than the assets, e.g. for bug investigation. I realized that QAM is taking up a bigger proportion in comparison to other teams. Queried the database and provided some data in #55637 with asking QAM to help
- Status changed from New to Workable
- Assignee set to okurz
- Status changed from Workable to In Progress
I am going over the job groups on https://openqa.suse.de/admin/assets# one by one and looking which ones are storing multiple builds of the same assets which looks like they have some reduncancy which I can reduce, e.g. "Development / Staging: SLE 15" with 120GB. Reducing that to 100GB. Another migration job group from 100GB to 80GB. Strangely it seems I can not edit job group properties of e.g. https://openqa.suse.de/admin/job_templates/34 directly which is based on YAML schedule but I should still be able to edit job group properties. Reported as #55646
I went over the complete list of all job groups again and reduced the job group quota where feasible. At least this way I could reduce the usage to 91%, that is below the current monitoring warning threshold again.
- Status changed from In Progress to Feedback
Finished one run of "du":
openqa:/var/lib/openqa # sort -n du.log
0G ./tmp
1G ./.config
1G ./.ssh
1G ./db
1G ./webui
10G ./backup
1643G ./images
2945G ./testresults
3674G ./share
8272G .
With this I guess we can actually live for the time being and have enough time until we finished an upgrade and can use more/bigger volumes.
To help with the challenge of finding out which part of our data is taking up how much space we split into individual partitions. They are a bit harder to manage though.
- Status changed from Feedback to Resolved
It seems coolo (and mkittler?) were more severe in constraining job group way more deleting out a lot more data. Also it seems we successfully tricked Engineering Infrastructure as they can not easily reduce the size of the partitions we have already. Some subtickets are still open but in general I think we are good.
Infra will come back to us if they got temporary room (you know Towers of Hanoi, no? :) - and the plan still is to reduce /results to 5TB. But we still have 1.5T headroom for that
Also available in: Atom
PDF