Project

General

Profile

action #70885

[osd][alert] flaky file system alert: /assets

Added by okurz 5 months ago. Updated 3 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Target version:
Start date:
2020-09-02
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

received alert email 2020-09-02 14:27Z

/*[Alerting] File systems alert*/

One of the file systems is too full

*Metric name*
*Value*
/assets: Used Percentage
94.207

30m later the status switched back to "OK" but I guess we can easily hit the limit again.

Problem

The alert is flaky as it went back to "ok" without explicit user action.

Suggestions

  • Make sure some assets are cleaned up as we can not keep that many and 4.7TB for assets is too much.
  • Research if a better hysteresis can be implemented in grafana, e.g. the alert would trigger if 94% is reached but only recover if usage goes below 92%

Further notes

I did not pause the alert as it is currently "ok" and we need to be careful that the available disk space is not completely depleted.

94% usage on a filesystem is already much. We must not increase the alert threshold further.


Related issues

Copied to openQA Infrastructure - action #71575: [osd][alert] limited /assets - idea: ask EngInfra for slow+cheap storage from central server for /assets/fixed onlyResolved2020-09-02

History

#1 Updated by mkittler 5 months ago

Since you mentioned in the chat that the growing size of fixed assets might be a problem: They are indeed 1394 GiB which is about 28 % of the total asset size. I've mentioned it on #testing. I'll wait a little bit for a response. Otherwise I'd just reduce the quotas for assets covered by the cleanup for now.

#2 Updated by mkittler 5 months ago

I reduced the quotas so the usage should drop below 92 % on the next cleanup. That's likely required anyways because fixed assets are also accounted to "full" groups like SLE15 so deleting those fixed assets would likely just lead to other assets kept longer but not free further space. (I can't retrigger a new cleanup task because one is currently running.)

#3 Updated by okurz 4 months ago

  • Priority changed from High to Urgent

we are back to 94% and this needs actions.

#4 Updated by mkittler 4 months ago

  • Assignee set to mkittler

I came up with the following proposal on testing which would hopefully help with the general "flakiness":

We're again at a critical level of disk usage of the assets partition on OSD. I wonder whether it would make sense to stop "overallocating". With that I mean: If you sum up the limits for all groups you'll end up with a number which is higher than the total > disk space we have. It only works because some groups are not actually utilizing their limit. The obvious disadvantage of this approach is that I need to figure which limits to reduce over and over again.
Wouldn't it make more sense if I'd shrink all group limits to fit the current utilization (maybe rounding up to the next number divisible by 5)? That alone wouldn't free any disk space but then I could easier free disk space by slightly decreasing all groups > a little bit. And I wouldn't have to annoy you with the problem anymore because unless someone increases the limits again the automatic cleanup should ensure the partition isn't getting completely full.

I'm waiting a little bit for feedback but unless a good objection comes up soon I'm going to go for it as "this needs action".

#5 Updated by mkittler 4 months ago

  • Status changed from Workable to In Progress

We're back 92 %. This time I not only reduced the limit of the big groups but also reduced the smaller ones and untracked assets. Additionally, I've now shrunk all groups to fit their current utilization (with a small margin) so they will hopefully not grow unattended anymore. However, we're still over-allocating a lot considering my database queries. Not sure where my accounting is wrong.

#6 Updated by mkittler 4 months ago

  • Assignee deleted (mkittler)
  • Priority changed from Urgent to High

Created SR to make the change for untracked assets persistent: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/358

Now we're even back at 90 %. That's good but I don't know how long it will be the case. Not sure how to avoid the mentioned over allocation (maybe my accounting is incorrect).

#7 Updated by nicksinger 4 months ago

mkittler wrote:

Created SR to make the change for untracked assets persistent: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/358

Thanks, I think this already helps a lot to take the severity down.

mkittler wrote:

Now we're even back at 90 %. That's good but I don't know how long it will be the case. Not sure how to avoid the mentioned over allocation (maybe my accounting is incorrect).

IMHO the overallocation is a separate issue to look into. For me this ticket is mainly about our alert and how it is structured. I've seen some talk in the past about this topic (unfortunately I lost the source to it) where predictions where used for such cases to rephrase the alert from "File system is full soon" to "If you don't act the FS will be full in the next 24h" and I think this is really what we're interested in. However I currently fail to understand the predictions provided by influxdb so I can't propose a better solution :(

#8 Updated by okurz 4 months ago

  • Copied to action #71575: [osd][alert] limited /assets - idea: ask EngInfra for slow+cheap storage from central server for /assets/fixed only added

#9 Updated by okurz 4 months ago

  • Status changed from In Progress to Workable

#10 Updated by mkittler 3 months ago

We're now at 63 % thanks to the new storage. Not sure if that's enough to consider the ticket resolved because technically we'll end up in the same situation again and might want to follow the 2nd suggestion.

#11 Updated by mkittler 3 months ago

  • Priority changed from High to Normal

#12 Updated by okurz 3 months ago

mkittler wrote:

We're now at 63 % thanks to the new storage. Not sure if that's enough to consider the ticket resolved because technically we'll end up in the same situation again and might want to follow the 2nd suggestion.

Yes, this ticket is not about /assets having space but to prevent the flakyness. I wonder, isn't that also linked to the time periods how often we call gru cleanup jobs? So somehow we need to design time values in grafana with that knowledge in mind.

#13 Updated by okurz 3 months ago

  • Priority changed from Normal to Low

Also available in: Atom PDF