action #70885: [osd][alert] flaky file system alert: /assets - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #70885

closed

[osd][alert] flaky file system alert: /assets

Added by okurz over 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-09-02

Due date:

2021-07-23

% Done:

Estimated time:

Tags:

alert

Description

Observation¶

received alert email 2020-09-02 14:27Z

/*[Alerting] File systems alert*/

One of the file systems is too full

*Metric name*
*Value*
/assets: Used Percentage
94.207

30m later the status switched back to "OK" but I guess we can easily hit the limit again.

panel can be found on
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=74&orgId=1

Problem¶

The alert is flaky as it went back to "ok" without explicit user action.

Suggestions¶

Make sure some assets are cleaned up as we can not keep that many and 4.7TB for assets is too much.
Research if a better hysteresis can be implemented in grafana, e.g. the alert would trigger if 94% is reached but only recover if usage goes below 92%

Further notes¶

I did not pause the alert as it is currently "ok" and we need to be careful that the available disk space is not completely depleted.

94% usage on a filesystem is already much. We must not increase the alert threshold further.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by mkittler over 4 years ago

Since you mentioned in the chat that the growing size of fixed assets might be a problem: They are indeed 1394 GiB which is about 28 % of the total asset size. I've mentioned it on #testing. I'll wait a little bit for a response. Otherwise I'd just reduce the quotas for assets covered by the cleanup for now.

Actions

Copy link

Updated by mkittler over 4 years ago

I reduced the quotas so the usage should drop below 92 % on the next cleanup. That's likely required anyways because fixed assets are also accounted to "full" groups like SLE15 so deleting those fixed assets would likely just lead to other assets kept longer but not free further space. (I can't retrigger a new cleanup task because one is currently running.)

Actions

Copy link

Updated by okurz over 4 years ago

Priority changed from High to Urgent

we are back to 94% and this needs actions.

Actions

Copy link

Updated by mkittler over 4 years ago

Assignee set to mkittler

I came up with the following proposal on testing which would hopefully help with the general "flakiness":

We're again at a critical level of disk usage of the assets partition on OSD. I wonder whether it would make sense to stop "overallocating". With that I mean: If you sum up the limits for all groups you'll end up with a number which is higher than the total > disk space we have. It only works because some groups are not actually utilizing their limit. The obvious disadvantage of this approach is that I need to figure which limits to reduce over and over again.
Wouldn't it make more sense if I'd shrink all group limits to fit the current utilization (maybe rounding up to the next number divisible by 5)? That alone wouldn't free any disk space but then I could easier free disk space by slightly decreasing all groups > a little bit. And I wouldn't have to annoy you with the problem anymore because unless someone increases the limits again the automatic cleanup should ensure the partition isn't getting completely full.

I'm waiting a little bit for feedback but unless a good objection comes up soon I'm going to go for it as "this needs action".

Actions

Copy link

Updated by mkittler over 4 years ago

Status changed from Workable to In Progress

We're back 92 %. This time I not only reduced the limit of the big groups but also reduced the smaller ones and untracked assets. Additionally, I've now shrunk all groups to fit their current utilization (with a small margin) so they will hopefully not grow unattended anymore. However, we're still over-allocating a lot considering my database queries. Not sure where my accounting is wrong.

Actions

Copy link

Updated by mkittler over 4 years ago

Assignee deleted (~~mkittler~~)
Priority changed from Urgent to High

Created SR to make the change for untracked assets persistent: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/358

Now we're even back at 90 %. That's good but I don't know how long it will be the case. Not sure how to avoid the mentioned over allocation (maybe my accounting is incorrect).

Actions

Copy link

Updated by nicksinger over 4 years ago

mkittler wrote:

Created SR to make the change for untracked assets persistent: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/358

Thanks, I think this already helps a lot to take the severity down.

mkittler wrote:

Now we're even back at 90 %. That's good but I don't know how long it will be the case. Not sure how to avoid the mentioned over allocation (maybe my accounting is incorrect).

IMHO the overallocation is a separate issue to look into. For me this ticket is mainly about our alert and how it is structured. I've seen some talk in the past about this topic (unfortunately I lost the source to it) where predictions where used for such cases to rephrase the alert from "File system is full soon" to "If you don't act the FS will be full in the next 24h" and I think this is really what we're interested in. However I currently fail to understand the predictions provided by influxdb so I can't propose a better solution :(

Actions

Copy link

Updated by okurz over 4 years ago

Copied to action #71575: [osd][alert] limited /assets - idea: ask EngInfra for slow+cheap storage from central server for /assets/fixed only added

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from In Progress to Workable

Actions

Copy link

#10

Updated by mkittler over 4 years ago

We're now at 63 % thanks to the new storage. Not sure if that's enough to consider the ticket resolved because technically we'll end up in the same situation again and might want to follow the 2nd suggestion.

Actions

Copy link

#11

Updated by mkittler over 4 years ago

Priority changed from High to Normal

Actions

Copy link

#12

Updated by okurz over 4 years ago

mkittler wrote:

We're now at 63 % thanks to the new storage. Not sure if that's enough to consider the ticket resolved because technically we'll end up in the same situation again and might want to follow the 2nd suggestion.

Yes, this ticket is not about /assets having space but to prevent the flakyness. I wonder, isn't that also linked to the time periods how often we call gru cleanup jobs? So somehow we need to design time values in grafana with that knowledge in mind.

Actions

Copy link

#13

Updated by okurz over 4 years ago

Priority changed from Normal to Low

Actions

Copy link

#14

Updated by okurz almost 4 years ago

Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

Actions

Copy link

#15

Updated by okurz almost 4 years ago

Description updated (diff)
Status changed from New to In Progress
Assignee set to okurz

I researched a bit about what is implemented/possible/not-implemented in grafana. I found the open and rather recent feature request https://github.com/grafana/grafana/issues/30119 which also mentions other features, e.g. the "debouncing" that is already included in grafana: "alert if 1m avg is above X for 5m". Other than this more does not seem to be possible with grafana. So the best we can do except for considering another alerting solution is to keep enough asset space free before alerting. As now in the mean-time we have "space-aware asset cleanup" we likely can live with the case that if the alert triggers just once then space-aware asset cleanup is not working as expected and an urgent remedy is needed, regardless if the alert is flaky at this point or not.

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=74&orgId=1&from=now-1y&to=now shows that with the space-aware cleanup since multiple months we are savely below the alert threshold of 90% and can tweak to go a little bit further to the red line without touching it. A little peak during that period was touching 84% so I guess we can move our limit from 80 to actually 84 to use some more of the available space for assets.

Actions

Copy link

#16

Updated by okurz almost 4 years ago

Due date set to 2021-07-23
Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/524

Actions

Copy link

#17

Updated by okurz almost 4 years ago

Status changed from Feedback to Resolved

merged yesterday and deployment succeeded, no alerts right now as expected. We should be good.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #70885

[osd][alert] flaky file system alert: /assets

Observation¶

Problem¶

Suggestions¶

Further notes¶

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by nicksinger over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago