action #154177
closedFile systems alert Salt: One of the file systems is too full size:M
0%
Description
Observation¶
From Grafana FIRING:1:
F0=90.11097415563623
From OSD:
# df -h
Filesystem Size Used Avail Use% Mounted on
…
/dev/vdc 10T 9.0T 1.1T 90% /assets
Suggestions¶
- DONE Add a silence http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DFile+systems+alert&matcher=grafana_folder%3DSalt&matcher=rule_uid%3Dai0h5ifVk&orgId=1
- DONE View dashboard http://stats.openqa-monitor.qa.suse.de/d/WebuiDb?orgId=1
- DONE View panel http://stats.openqa-monitor.qa.suse.de/d/WebuiDb?orgId=1&viewPanel=74
- DONE Check which assets take the most space
- DONE (it runs) Crosscheck that our asset cleanup is actually running
- DONE Our space-aware cleanup should keep a buffer free so if we are now exceeding 90% that likely means that job group quotas are way too high in sum
- DONE Check settings per job group and adjust quotas as necessary
Rollback steps¶
- DONE Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences
Updated by mkittler 11 months ago
- Description updated (diff)
- Status changed from Workable to Feedback
/assets
are about 1 TiB so in order to be at 80 % again we need to free 100 GiB.
The cleanup is working. We've been above 80 % for at least 30 days so the space-awareness did not prevent cleanups from happening.
Untracked assets are a big group. Many of them are fixed but I suppose it would make sense to change untracked_assets_storage_duration from 7 days to e.g. 4 days: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100
I reduced quotas:
- https://openqa.suse.de/admin/job_templates/520 - from 400 GiB to 360 GiB as it takes a lot of space for a development group
- https://openqa.suse.de/admin/job_templates/110 - from 460 GiB to 430 GiB as it then is still by far the biggest group
- https://openqa.suse.de/admin/job_templates/446 - from 240 GiB to 230 GiB as it is a big development group
- https://openqa.suse.de/admin/job_templates/125 - from 200 GiB to 190 GiB as it is a big development group
This will free 90 GiB which might be good enough together with what the untracked assets will gain us. If it is still not enough I'll reduce the quotas of all groups but only slightly.
Updated by okurz 11 months ago
mkittler wrote in #note-6:
/assets
are about 1 TiB so in order to be at 80 % again we need to free 100 GiB.
What do you mean by that? /assets is 10TiB but 1.1TiB are free so to reach 80% we need to free 1TiB.
The cleanup is working. We've been above 80 % for at least 30 days so the space-awareness did not prevent cleanups from happening.
Untracked assets are a big group. Many of them are fixed but I suppose it would make sense to change untracked_assets_storage_duration from 7 days to e.g. 4 days: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100
It seems a big contributor is "SLE-BCI" for both s390x and ppc64le. Could it be that those assets are synced but never used by jobs whereas x86_64 and aarch64 are properly tracked in according job groups?
Updated by mkittler 11 months ago · Edited
What do you mean by that? /assets is 10TiB but 1.1TiB are free so to reach 80% we need to free 1TiB.
Oh, yes - it is 10 TiB and not just 1 TiB.
The quotas on the groups have been applies (not the retention of untracked assets). Unfortunately it didn't have any big effect. Likely assets are now just accounted to some of the smaller groups which I haven't changed. So I guess I'll have to do that but before I'll wait until the change for untracked assets has been applied (so far we're at 3458.99 GiB for those).
Updated by mkittler 11 months ago
It seems a big contributor is "SLE-BCI" for both s390x and ppc64le. Could it be that those assets are synced but never used by jobs whereas x86_64 and aarch64 are properly tracked in according job groups?
Yes - I was also wondering about that. It could mean that. Or the corresponding jobs have already been removed so the asset ended up as untracked (or do we handle that in a better way?).
Updated by mkittler 11 months ago · Edited
Again no effect. Log messages like
martchus@openqa:~> sudo tail -f /var/log/openqa_gru
[2024-01-24T15:18:00.040231+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-SLES-POOL-x86_64-C-CURRENT-Media1 is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.041281+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-HA-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.042544+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-RT-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.044236+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-SLED-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.045967+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Module-Python3-POOL-x86_64-V-CURRENT-Media1 is not in any job group and will be deleted in 6 days
[2024-01-24T15:18:00.047421+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Module-Python3-POOL-x86_64-D-CURRENT-Media1 is not in any job group and will be deleted in 6 days
make it look like that not even the 7 days we had configured before were effective.
That's strange considering sudo -u geekotest /usr/share/openqa/script/openqa eval -V 'app->config->{misc_limits}->{untracked_assets_storage_duration}'
¹ prints 4 as expected and I did restart openqa-gru
.
EDIT: Ok, this is actually a feature and an intentional configuration via introduced by 981da57e04d1f30dc74844f98aff3598d2238d95:
[assets/storage_duration]
CURRENT = 30
The commit message just says "Add a limit of 30 days for CURRENT
repos to the config".
I suppose we should also reduce those: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101
¹ keys are C&P from our code
Updated by okurz 11 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 is merged. I failed to find a proper explanation for what "CURRENT" means nor did I find assets/storage_duration
in https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini . Is that something we should add in there?
Updated by okurz 11 months ago · Edited
okurz wrote in #note-14:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 is merged. I failed to find a proper explanation for what "CURRENT" means nor did I find
assets/storage_duration
in https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini . Is that something we should add in there?
nevermind, found it in http://open.qa/docs/#_configuring_limit_for_groupless_assets
So with that how about reverting https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100 ?
Updated by mkittler 11 months ago
After https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 has been merged we're down at 85 %.
Considering we're still not at ~ 80 % I'd say we wait with reverting.
Updated by okurz 11 months ago
- Related to action #138512: [BCI] Reestablish the actual coverage of s390x and ppc64le added
Updated by mkittler 11 months ago
I created an SD ticket to request more storage: https://sd.suse.com/servicedesk/customer/portal/1/SD-146279
Updated by okurz 11 months ago
- Description updated (diff)
- Due date deleted (
2024-02-07) - Status changed from In Progress to Resolved
To come further below our triggering thresholds I went over more job group settings and reduced quotas.
- https://openqa.suse.de/admin/job_templates/520 Yam Support Images 360->280
- https://openqa.suse.de/admin/job_templates/139 SLE 12 SP5 Functional: Server 150->20
- https://openqa.suse.de/admin/job_templates/421 YaST Maintenance Updates 190->180
- https://openqa.suse.de/admin/job_templates/125 Staging: SLE 15 180->160
- https://openqa.suse.de/admin/job_templates/446 https://openqa.suse.de/admin/job_templates/446 220->160
- https://openqa.suse.de/admin/job_templates/477 Test-LilyZhao 100->10
- https://openqa.suse.de/admin/job_templates/321 Maintenance - QR - SLE15SP2 120->5
- many more from 170GB to 160GB
And triggered cleanup. Now we are at 78%.
All steps and rollback steps done and marked as such in ticket description. The SD ticket might be done or not in indefinite team so resolving here already as I don't want us to wait for that.
Updated by okurz 4 months ago
- Copied to action #165096: [osd] Extend /assets + /space-slow to allow to store more assets size:S added