Actions
action #70834
closed[alert] Refine I/O time alerts for OSD
Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-09-02
Due date:
% Done:
0%
Estimated time:
Tags:
Description
We have several IO time alerts for OSD itself:
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=46&fullscreen&edit&tab=alert
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=47&fullscreen&edit&tab=alert
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=48&fullscreen&edit&tab=alert
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=57&fullscreen&edit&tab=alert
They need to be reworked so that:
- The right disk is shown for the right purpose (e.g. /dev/vde is not /results any longer)
- https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/telegraf-webui.conf#L32 might needs adjustments to store persistent identifier like UUIDs
- The panel itself maybe can be generated out of info from salt (mountpoint): https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/webui.dashboard.json#L5769
- DONE:
The alert thresholds need to be adjusted to not trigger that oftenSpikes of up to 7s seem to happen from time to timeThe situation gets critical if these spikes continue for several minutes
All above linked alerts are on pause right now since they don't provide a big benefit being that flaky.
Actions