action #154177: File systems alert Salt: One of the file systems is too full size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #154177

closed

File systems alert Salt: One of the file systems is too full size:M

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2024-01-24

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

From Grafana [FIRING:1] (File systems alert Salt ai0h5ifVk):

 F0=90.11097415563623

From OSD:

# df -h
Filesystem      Size  Used Avail Use% Mounted on
…
/dev/vdc         10T  9.0T  1.1T  90% /assets

Suggestions¶

DONE Add a silence http://stats.openqa-monitor.qa.suse.de/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DFile+systems+alert&matcher=grafana_folder%3DSalt&matcher=rule_uid%3Dai0h5ifVk&orgId=1
DONE View dashboard http://stats.openqa-monitor.qa.suse.de/d/WebuiDb?orgId=1
DONE View panel http://stats.openqa-monitor.qa.suse.de/d/WebuiDb?orgId=1&viewPanel=74
DONE Check which assets take the most space
DONE (it runs) Crosscheck that our asset cleanup is actually running
DONE Our space-aware cleanup should keep a buffer free so if we are now exceeding 90% that likely means that job group quotas are way too high in sum
DONE Check settings per job group and adjust quotas as necessary

Rollback steps¶

DONE Remove silence from https://stats.openqa-monitor.qa.suse.de/alerting/silences

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by livdywan over 1 year ago

Description updated (diff)
Parent task deleted (~~#151582~~)

Actions

Copy link

Updated by okurz over 1 year ago

Priority changed from High to Urgent

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)
Status changed from New to Workable
Assignee set to mkittler

Actions

Copy link

Updated by okurz over 1 year ago

Subject changed from File systems alert Salt: One of the file systems is too full to File systems alert Salt: One of the file systems is too full size:M

Actions

Copy link

Updated by mkittler over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by mkittler over 1 year ago

Description updated (diff)
Status changed from Workable to Feedback

/assets are about 1 TiB so in order to be at 80 % again we need to free 100 GiB.

The cleanup is working. We've been above 80 % for at least 30 days so the space-awareness did not prevent cleanups from happening.

Untracked assets are a big group. Many of them are fixed but I suppose it would make sense to change untracked_assets_storage_duration from 7 days to e.g. 4 days: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100

I reduced quotas:

https://openqa.suse.de/admin/job_templates/520 - from 400 GiB to 360 GiB as it takes a lot of space for a development group
https://openqa.suse.de/admin/job_templates/110 - from 460 GiB to 430 GiB as it then is still by far the biggest group
https://openqa.suse.de/admin/job_templates/446 - from 240 GiB to 230 GiB as it is a big development group
https://openqa.suse.de/admin/job_templates/125 - from 200 GiB to 190 GiB as it is a big development group

This will free 90 GiB which might be good enough together with what the untracked assets will gain us. If it is still not enough I'll reduce the quotas of all groups but only slightly.

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Feedback to In Progress

Our rules prevent an urgent ticket in "Feedback". As you are actively discussing this ticket I think it's actually still "In Progress"

Actions

Copy link

Updated by okurz over 1 year ago

Due date set to 2024-02-07

Actions

Copy link

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100 merged

Actions

Copy link

#10

Updated by okurz over 1 year ago

mkittler wrote in #note-6:

/assets are about 1 TiB so in order to be at 80 % again we need to free 100 GiB.

What do you mean by that? /assets is 10TiB but 1.1TiB are free so to reach 80% we need to free 1TiB.

The cleanup is working. We've been above 80 % for at least 30 days so the space-awareness did not prevent cleanups from happening.

Untracked assets are a big group. Many of them are fixed but I suppose it would make sense to change untracked_assets_storage_duration from 7 days to e.g. 4 days: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100

It seems a big contributor is "SLE-BCI" for both s390x and ppc64le. Could it be that those assets are synced but never used by jobs whereas x86_64 and aarch64 are properly tracked in according job groups?

Actions

Copy link

#11

Updated by mkittler over 1 year ago · Edited

What do you mean by that? /assets is 10TiB but 1.1TiB are free so to reach 80% we need to free 1TiB.

Oh, yes - it is 10 TiB and not just 1 TiB.

The quotas on the groups have been applies (not the retention of untracked assets). Unfortunately it didn't have any big effect. Likely assets are now just accounted to some of the smaller groups which I haven't changed. So I guess I'll have to do that but before I'll wait until the change for untracked assets has been applied (so far we're at 3458.99 GiB for those).

Actions

Copy link

#12

Updated by mkittler over 1 year ago

It seems a big contributor is "SLE-BCI" for both s390x and ppc64le. Could it be that those assets are synced but never used by jobs whereas x86_64 and aarch64 are properly tracked in according job groups?

Yes - I was also wondering about that. It could mean that. Or the corresponding jobs have already been removed so the asset ended up as untracked (or do we handle that in a better way?).

Actions

Copy link

#13

Updated by mkittler over 1 year ago · Edited

Again no effect. Log messages like

martchus@openqa:~> sudo tail -f /var/log/openqa_gru
[2024-01-24T15:18:00.040231+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-SLES-POOL-x86_64-C-CURRENT-Media1 is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.041281+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-HA-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.042544+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-RT-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.044236+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-SLED-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.045967+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Module-Python3-POOL-x86_64-V-CURRENT-Media1 is not in any job group and will be deleted in 6 days
[2024-01-24T15:18:00.047421+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Module-Python3-POOL-x86_64-D-CURRENT-Media1 is not in any job group and will be deleted in 6 days

make it look like that not even the 7 days we had configured before were effective.

That's strange considering sudo -u geekotest /usr/share/openqa/script/openqa eval -V 'app->config->{misc_limits}->{untracked_assets_storage_duration}'¹ prints 4 as expected and I did restart openqa-gru.

EDIT: Ok, this is actually a feature and an intentional configuration via introduced by 981da57e04d1f30dc74844f98aff3598d2238d95:

[assets/storage_duration]
CURRENT = 30

The commit message just says "Add a limit of 30 days for CURRENT repos to the config".

I suppose we should also reduce those: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101

¹ keys are C&P from our code

Actions

Copy link

#14

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 is merged. I failed to find a proper explanation for what "CURRENT" means nor did I find assets/storage_duration in https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini . Is that something we should add in there?

Actions

Copy link

#15

Updated by okurz over 1 year ago · Edited

okurz wrote in #note-14:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 is merged. I failed to find a proper explanation for what "CURRENT" means nor did I find assets/storage_duration in https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini . Is that something we should add in there?

nevermind, found it in http://open.qa/docs/#_configuring_limit_for_groupless_assets

So with that how about reverting https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100 ?

Actions

Copy link

#16

Updated by mkittler over 1 year ago

After https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 has been merged we're down at 85 %.

Considering we're still not at ~ 80 % I'd say we wait with reverting.

Actions

Copy link

#17

Updated by mkittler over 1 year ago

It seems a big contributor is "SLE-BCI" for both s390x and ppc64le. Could it be that those assets are synced but never used by jobs whereas x86_64 and aarch64 are properly tracked in according job groups?

This is a temporary situation, see #138512.

Actions

Copy link

#18

Updated by okurz over 1 year ago

Related to action #138512: [BCI] Reestablish the actual coverage of s390x and ppc64le added

Actions

Copy link

#19

Updated by mkittler over 1 year ago

I created an SD ticket to request more storage: https://sd.suse.com/servicedesk/customer/portal/1/SD-146279

Actions

Copy link

#20

Updated by okurz over 1 year ago

Priority changed from Urgent to High

Actions

Copy link

#21

Updated by okurz over 1 year ago

Description updated (diff)
Due date deleted (~~2024-02-07~~)
Status changed from In Progress to Resolved

To come further below our triggering thresholds I went over more job group settings and reduced quotas.

https://openqa.suse.de/admin/job_templates/520 Yam Support Images 360->280
https://openqa.suse.de/admin/job_templates/139 SLE 12 SP5 Functional: Server 150->20
https://openqa.suse.de/admin/job_templates/421 YaST Maintenance Updates 190->180
https://openqa.suse.de/admin/job_templates/125 Staging: SLE 15 180->160
https://openqa.suse.de/admin/job_templates/446 https://openqa.suse.de/admin/job_templates/446 220->160
https://openqa.suse.de/admin/job_templates/477 Test-LilyZhao 100->10
https://openqa.suse.de/admin/job_templates/321 Maintenance - QR - SLE15SP2 120->5
many more from 170GB to 160GB

And triggered cleanup. Now we are at 78%.

All steps and rollback steps done and marked as such in ticket description. The SD ticket might be done or not in indefinite team so resolving here already as I don't want us to wait for that.

Actions

Copy link

#22

Updated by okurz 10 months ago

Copied to action #165096: [osd] Extend /assets + /space-slow to allow to store more assets size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #154177

File systems alert Salt: One of the file systems is too full size:M

Observation¶

Suggestions¶

Rollback steps¶

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago · Edited

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago · Edited

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago · Edited

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz 10 months ago