action #70885
closed
[osd][alert] flaky file system alert: /assets
Added by okurz over 4 years ago.
Updated over 3 years ago.
Description
Observation¶
received alert email 2020-09-02 14:27Z
/*[Alerting] File systems alert*/
One of the file systems is too full
*Metric name*
*Value*
/assets: Used Percentage
94.207
30m later the status switched back to "OK" but I guess we can easily hit the limit again.
panel can be found on
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=74&orgId=1
Problem¶
The alert is flaky as it went back to "ok" without explicit user action.
Suggestions¶
- Make sure some assets are cleaned up as we can not keep that many and 4.7TB for assets is too much.
- Research if a better hysteresis can be implemented in grafana, e.g. the alert would trigger if 94% is reached but only recover if usage goes below 92%
Further notes¶
I did not pause the alert as it is currently "ok" and we need to be careful that the available disk space is not completely depleted.
94% usage on a filesystem is already much. We must not increase the alert threshold further.
Since you mentioned in the chat that the growing size of fixed assets might be a problem: They are indeed 1394 GiB which is about 28 % of the total asset size. I've mentioned it on #testing. I'll wait a little bit for a response. Otherwise I'd just reduce the quotas for assets covered by the cleanup for now.
I reduced the quotas so the usage should drop below 92 % on the next cleanup. That's likely required anyways because fixed assets are also accounted to "full" groups like SLE15 so deleting those fixed assets would likely just lead to other assets kept longer but not free further space. (I can't retrigger a new cleanup task because one is currently running.)
- Priority changed from High to Urgent
we are back to 94% and this needs actions.
I came up with the following proposal on testing which would hopefully help with the general "flakiness":
We're again at a critical level of disk usage of the assets partition on OSD. I wonder whether it would make sense to stop "overallocating". With that I mean: If you sum up the limits for all groups you'll end up with a number which is higher than the total > disk space we have. It only works because some groups are not actually utilizing their limit. The obvious disadvantage of this approach is that I need to figure which limits to reduce over and over again.
Wouldn't it make more sense if I'd shrink all group limits to fit the current utilization (maybe rounding up to the next number divisible by 5)? That alone wouldn't free any disk space but then I could easier free disk space by slightly decreasing all groups > a little bit. And I wouldn't have to annoy you with the problem anymore because unless someone increases the limits again the automatic cleanup should ensure the partition isn't getting completely full.
I'm waiting a little bit for feedback but unless a good objection comes up soon I'm going to go for it as "this needs action".
- Status changed from Workable to In Progress
We're back 92 %. This time I not only reduced the limit of the big groups but also reduced the smaller ones and untracked assets. Additionally, I've now shrunk all groups to fit their current utilization (with a small margin) so they will hopefully not grow unattended anymore. However, we're still over-allocating a lot considering my database queries. Not sure where my accounting is wrong.
- Assignee deleted (
mkittler)
- Priority changed from Urgent to High
mkittler wrote:
Created SR to make the change for untracked assets persistent: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/358
Thanks, I think this already helps a lot to take the severity down.
mkittler wrote:
Now we're even back at 90 %. That's good but I don't know how long it will be the case. Not sure how to avoid the mentioned over allocation (maybe my accounting is incorrect).
IMHO the overallocation is a separate issue to look into. For me this ticket is mainly about our alert and how it is structured. I've seen some talk in the past about this topic (unfortunately I lost the source to it) where predictions where used for such cases to rephrase the alert from "File system is full soon" to "If you don't act the FS will be full in the next 24h" and I think this is really what we're interested in. However I currently fail to understand the predictions provided by influxdb so I can't propose a better solution :(
- Copied to action #71575: [osd][alert] limited /assets - idea: ask EngInfra for slow+cheap storage from central server for /assets/fixed only added
- Status changed from In Progress to Workable
We're now at 63 % thanks to the new storage. Not sure if that's enough to consider the ticket resolved because technically we'll end up in the same situation again and might want to follow the 2nd suggestion.
- Priority changed from High to Normal
mkittler wrote:
We're now at 63 % thanks to the new storage. Not sure if that's enough to consider the ticket resolved because technically we'll end up in the same situation again and might want to follow the 2nd suggestion.
Yes, this ticket is not about /assets having space but to prevent the flakyness. I wonder, isn't that also linked to the time periods how often we call gru cleanup jobs? So somehow we need to design time values in grafana with that knowledge in mind.
- Priority changed from Normal to Low
- Status changed from Workable to New
moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to okurz
I researched a bit about what is implemented/possible/not-implemented in grafana. I found the open and rather recent feature request https://github.com/grafana/grafana/issues/30119 which also mentions other features, e.g. the "debouncing" that is already included in grafana: "alert if 1m avg is above X for 5m". Other than this more does not seem to be possible with grafana. So the best we can do except for considering another alerting solution is to keep enough asset space free before alerting. As now in the mean-time we have "space-aware asset cleanup" we likely can live with the case that if the alert triggers just once then space-aware asset cleanup is not working as expected and an urgent remedy is needed, regardless if the alert is flaky at this point or not.
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=74&orgId=1&from=now-1y&to=now shows that with the space-aware cleanup since multiple months we are savely below the alert threshold of 90% and can tweak to go a little bit further to the red line without touching it. A little peak during that period was touching 84% so I guess we can move our limit from 80 to actually 84 to use some more of the available space for assets.
- Due date set to 2021-07-23
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
merged yesterday and deployment succeeded, no alerts right now as expected. We should be good.
Also available in: Atom
PDF