Project

General

Profile

Actions

action #70885

closed

[osd][alert] flaky file system alert: /assets

Added by okurz over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2020-09-02
Due date:
2021-07-23
% Done:

0%

Estimated time:
Tags:

Description

Observation

received alert email 2020-09-02 14:27Z

/*[Alerting] File systems alert*/

One of the file systems is too full

*Metric name*
*Value*
/assets: Used Percentage
94.207

30m later the status switched back to "OK" but I guess we can easily hit the limit again.

panel can be found on
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=74&orgId=1

Problem

The alert is flaky as it went back to "ok" without explicit user action.

Suggestions

  • Make sure some assets are cleaned up as we can not keep that many and 4.7TB for assets is too much.
  • Research if a better hysteresis can be implemented in grafana, e.g. the alert would trigger if 94% is reached but only recover if usage goes below 92%

Further notes

I did not pause the alert as it is currently "ok" and we need to be careful that the available disk space is not completely depleted.

94% usage on a filesystem is already much. We must not increase the alert threshold further.


Related issues 1 (0 open1 closed)

Copied to openQA Infrastructure - action #71575: [osd][alert] limited /assets - idea: ask EngInfra for slow+cheap storage from central server for /assets/fixed onlyResolvedmkittler2020-09-02

Actions
Actions #1

Updated by mkittler over 3 years ago

Since you mentioned in the chat that the growing size of fixed assets might be a problem: They are indeed 1394 GiB which is about 28 % of the total asset size. I've mentioned it on #testing. I'll wait a little bit for a response. Otherwise I'd just reduce the quotas for assets covered by the cleanup for now.

Actions #2

Updated by mkittler over 3 years ago

I reduced the quotas so the usage should drop below 92 % on the next cleanup. That's likely required anyways because fixed assets are also accounted to "full" groups like SLE15 so deleting those fixed assets would likely just lead to other assets kept longer but not free further space. (I can't retrigger a new cleanup task because one is currently running.)

Actions #3

Updated by okurz over 3 years ago

  • Priority changed from High to Urgent

we are back to 94% and this needs actions.

Actions #4

Updated by mkittler over 3 years ago

  • Assignee set to mkittler

I came up with the following proposal on testing which would hopefully help with the general "flakiness":

We're again at a critical level of disk usage of the assets partition on OSD. I wonder whether it would make sense to stop "overallocating". With that I mean: If you sum up the limits for all groups you'll end up with a number which is higher than the total > disk space we have. It only works because some groups are not actually utilizing their limit. The obvious disadvantage of this approach is that I need to figure which limits to reduce over and over again.
Wouldn't it make more sense if I'd shrink all group limits to fit the current utilization (maybe rounding up to the next number divisible by 5)? That alone wouldn't free any disk space but then I could easier free disk space by slightly decreasing all groups > a little bit. And I wouldn't have to annoy you with the problem anymore because unless someone increases the limits again the automatic cleanup should ensure the partition isn't getting completely full.

I'm waiting a little bit for feedback but unless a good objection comes up soon I'm going to go for it as "this needs action".

Actions #5

Updated by mkittler over 3 years ago

  • Status changed from Workable to In Progress

We're back 92 %. This time I not only reduced the limit of the big groups but also reduced the smaller ones and untracked assets. Additionally, I've now shrunk all groups to fit their current utilization (with a small margin) so they will hopefully not grow unattended anymore. However, we're still over-allocating a lot considering my database queries. Not sure where my accounting is wrong.

Actions #6

Updated by mkittler over 3 years ago

  • Assignee deleted (mkittler)
  • Priority changed from Urgent to High

Created SR to make the change for untracked assets persistent: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/358

Now we're even back at 90 %. That's good but I don't know how long it will be the case. Not sure how to avoid the mentioned over allocation (maybe my accounting is incorrect).

Actions #7

Updated by nicksinger over 3 years ago

mkittler wrote:

Created SR to make the change for untracked assets persistent: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/358

Thanks, I think this already helps a lot to take the severity down.

mkittler wrote:

Now we're even back at 90 %. That's good but I don't know how long it will be the case. Not sure how to avoid the mentioned over allocation (maybe my accounting is incorrect).

IMHO the overallocation is a separate issue to look into. For me this ticket is mainly about our alert and how it is structured. I've seen some talk in the past about this topic (unfortunately I lost the source to it) where predictions where used for such cases to rephrase the alert from "File system is full soon" to "If you don't act the FS will be full in the next 24h" and I think this is really what we're interested in. However I currently fail to understand the predictions provided by influxdb so I can't propose a better solution :(

Actions #8

Updated by okurz over 3 years ago

  • Copied to action #71575: [osd][alert] limited /assets - idea: ask EngInfra for slow+cheap storage from central server for /assets/fixed only added
Actions #9

Updated by okurz over 3 years ago

  • Status changed from In Progress to Workable
Actions #10

Updated by mkittler over 3 years ago

We're now at 63 % thanks to the new storage. Not sure if that's enough to consider the ticket resolved because technically we'll end up in the same situation again and might want to follow the 2nd suggestion.

Actions #11

Updated by mkittler over 3 years ago

  • Priority changed from High to Normal
Actions #12

Updated by okurz over 3 years ago

mkittler wrote:

We're now at 63 % thanks to the new storage. Not sure if that's enough to consider the ticket resolved because technically we'll end up in the same situation again and might want to follow the 2nd suggestion.

Yes, this ticket is not about /assets having space but to prevent the flakyness. I wonder, isn't that also linked to the time periods how often we call gru cleanup jobs? So somehow we need to design time values in grafana with that knowledge in mind.

Actions #13

Updated by okurz over 3 years ago

  • Priority changed from Normal to Low
Actions #14

Updated by okurz almost 3 years ago

  • Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

Actions #15

Updated by okurz almost 3 years ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz

I researched a bit about what is implemented/possible/not-implemented in grafana. I found the open and rather recent feature request https://github.com/grafana/grafana/issues/30119 which also mentions other features, e.g. the "debouncing" that is already included in grafana: "alert if 1m avg is above X for 5m". Other than this more does not seem to be possible with grafana. So the best we can do except for considering another alerting solution is to keep enough asset space free before alerting. As now in the mean-time we have "space-aware asset cleanup" we likely can live with the case that if the alert triggers just once then space-aware asset cleanup is not working as expected and an urgent remedy is needed, regardless if the alert is flaky at this point or not.

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=74&orgId=1&from=now-1y&to=now shows that with the space-aware cleanup since multiple months we are savely below the alert threshold of 90% and can tweak to go a little bit further to the red line without touching it. A little peak during that period was touching 84% so I guess we can move our limit from 80 to actually 84 to use some more of the available space for assets.

Actions #16

Updated by okurz almost 3 years ago

  • Due date set to 2021-07-23
  • Status changed from In Progress to Feedback
Actions #17

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Resolved

merged yesterday and deployment succeeded, no alerts right now as expected. We should be good.

Actions

Also available in: Atom PDF