Project

General

Profile

Actions

action #154177

closed

File systems alert Salt: One of the file systems is too full size:M

Added by livdywan 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2024-01-24
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

From Grafana FIRING:1:

 F0=90.11097415563623

From OSD:

# df -h
Filesystem      Size  Used Avail Use% Mounted on
…
/dev/vdc         10T  9.0T  1.1T  90% /assets

Suggestions

Rollback steps


Related issues 2 (0 open2 closed)

Related to Containers - action #138512: [BCI] Reestablish the actual coverage of s390x and ppc64leResolvedmgrossu2023-10-252024-01-31

Actions
Copied to openQA Infrastructure - action #165096: [osd] Extend /assets + /space-slow to allow to store more assets size:SResolvedlivdywan2024-01-24

Actions
Actions #1

Updated by livdywan 8 months ago

  • Description updated (diff)
  • Parent task deleted (#151582)
Actions #2

Updated by okurz 8 months ago

  • Priority changed from High to Urgent
Actions #3

Updated by okurz 8 months ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to mkittler
Actions #4

Updated by okurz 8 months ago

  • Subject changed from File systems alert Salt: One of the file systems is too full to File systems alert Salt: One of the file systems is too full size:M
Actions #5

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #6

Updated by mkittler 8 months ago

  • Description updated (diff)
  • Status changed from Workable to Feedback

/assets are about 1 TiB so in order to be at 80 % again we need to free 100 GiB.

The cleanup is working. We've been above 80 % for at least 30 days so the space-awareness did not prevent cleanups from happening.

Untracked assets are a big group. Many of them are fixed but I suppose it would make sense to change untracked_assets_storage_duration from 7 days to e.g. 4 days: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100

I reduced quotas:

This will free 90 GiB which might be good enough together with what the untracked assets will gain us. If it is still not enough I'll reduce the quotas of all groups but only slightly.

Actions #7

Updated by okurz 8 months ago

  • Status changed from Feedback to In Progress

Our rules prevent an urgent ticket in "Feedback". As you are actively discussing this ticket I think it's actually still "In Progress"

Actions #8

Updated by okurz 8 months ago

  • Due date set to 2024-02-07
Actions #10

Updated by okurz 8 months ago

mkittler wrote in #note-6:

/assets are about 1 TiB so in order to be at 80 % again we need to free 100 GiB.

What do you mean by that? /assets is 10TiB but 1.1TiB are free so to reach 80% we need to free 1TiB.

The cleanup is working. We've been above 80 % for at least 30 days so the space-awareness did not prevent cleanups from happening.

Untracked assets are a big group. Many of them are fixed but I suppose it would make sense to change untracked_assets_storage_duration from 7 days to e.g. 4 days: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100

It seems a big contributor is "SLE-BCI" for both s390x and ppc64le. Could it be that those assets are synced but never used by jobs whereas x86_64 and aarch64 are properly tracked in according job groups?

Actions #11

Updated by mkittler 8 months ago · Edited

What do you mean by that? /assets is 10TiB but 1.1TiB are free so to reach 80% we need to free 1TiB.

Oh, yes - it is 10 TiB and not just 1 TiB.


The quotas on the groups have been applies (not the retention of untracked assets). Unfortunately it didn't have any big effect. Likely assets are now just accounted to some of the smaller groups which I haven't changed. So I guess I'll have to do that but before I'll wait until the change for untracked assets has been applied (so far we're at 3458.99 GiB for those).

Actions #12

Updated by mkittler 8 months ago

It seems a big contributor is "SLE-BCI" for both s390x and ppc64le. Could it be that those assets are synced but never used by jobs whereas x86_64 and aarch64 are properly tracked in according job groups?

Yes - I was also wondering about that. It could mean that. Or the corresponding jobs have already been removed so the asset ended up as untracked (or do we handle that in a better way?).

Actions #13

Updated by mkittler 8 months ago · Edited

Again no effect. Log messages like

martchus@openqa:~> sudo tail -f /var/log/openqa_gru
[2024-01-24T15:18:00.040231+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-SLES-POOL-x86_64-C-CURRENT-Media1 is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.041281+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-HA-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.042544+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-RT-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.044236+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Product-SLED-POOL-x86_64-C-CURRENT-Media1.license is not in any job group and will be deleted in 13 days
[2024-01-24T15:18:00.045967+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Module-Python3-POOL-x86_64-V-CURRENT-Media1 is not in any job group and will be deleted in 6 days
[2024-01-24T15:18:00.047421+01:00] [info] [pid:8461] Asset repo/SLE-15-SP6-Module-Python3-POOL-x86_64-D-CURRENT-Media1 is not in any job group and will be deleted in 6 days

make it look like that not even the 7 days we had configured before were effective.

That's strange considering sudo -u geekotest /usr/share/openqa/script/openqa eval -V 'app->config->{misc_limits}->{untracked_assets_storage_duration}'¹ prints 4 as expected and I did restart openqa-gru.

EDIT: Ok, this is actually a feature and an intentional configuration via introduced by 981da57e04d1f30dc74844f98aff3598d2238d95:

[assets/storage_duration]
CURRENT = 30

The commit message just says "Add a limit of 30 days for CURRENT repos to the config".

I suppose we should also reduce those: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101


¹ keys are C&P from our code

Actions #14

Updated by okurz 8 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 is merged. I failed to find a proper explanation for what "CURRENT" means nor did I find assets/storage_duration in https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini . Is that something we should add in there?

Actions #15

Updated by okurz 8 months ago · Edited

okurz wrote in #note-14:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 is merged. I failed to find a proper explanation for what "CURRENT" means nor did I find assets/storage_duration in https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini . Is that something we should add in there?

nevermind, found it in http://open.qa/docs/#_configuring_limit_for_groupless_assets

So with that how about reverting https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1100 ?

Actions #16

Updated by mkittler 8 months ago

After https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1101 has been merged we're down at 85 %.

Considering we're still not at ~ 80 % I'd say we wait with reverting.

Actions #17

Updated by mkittler 8 months ago

It seems a big contributor is "SLE-BCI" for both s390x and ppc64le. Could it be that those assets are synced but never used by jobs whereas x86_64 and aarch64 are properly tracked in according job groups?

This is a temporary situation, see #138512.

Actions #18

Updated by okurz 8 months ago

  • Related to action #138512: [BCI] Reestablish the actual coverage of s390x and ppc64le added
Actions #19

Updated by mkittler 8 months ago

I created an SD ticket to request more storage: https://sd.suse.com/servicedesk/customer/portal/1/SD-146279

Actions #20

Updated by okurz 8 months ago

  • Priority changed from Urgent to High
Actions #21

Updated by okurz 8 months ago

  • Description updated (diff)
  • Due date deleted (2024-02-07)
  • Status changed from In Progress to Resolved

To come further below our triggering thresholds I went over more job group settings and reduced quotas.

And triggered cleanup. Now we are at 78%.

All steps and rollback steps done and marked as such in ticket description. The SD ticket might be done or not in indefinite team so resolving here already as I don't want us to wait for that.

Actions #22

Updated by okurz about 1 month ago

  • Copied to action #165096: [osd] Extend /assets + /space-slow to allow to store more assets size:S added
Actions

Also available in: Atom PDF