Actions
action #70834
closed[alert] Refine I/O time alerts for OSD
Start date:
2020-09-02
Due date:
% Done:
0%
Estimated time:
Tags:
Description
We have several IO time alerts for OSD itself:
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=46&fullscreen&edit&tab=alert
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=47&fullscreen&edit&tab=alert
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=48&fullscreen&edit&tab=alert
- https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=57&fullscreen&edit&tab=alert
They need to be reworked so that:
- The right disk is shown for the right purpose (e.g. /dev/vde is not /results any longer)
- https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/telegraf-webui.conf#L32 might needs adjustments to store persistent identifier like UUIDs
- The panel itself maybe can be generated out of info from salt (mountpoint): https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/webui.dashboard.json#L5769
- DONE:
The alert thresholds need to be adjusted to not trigger that oftenSpikes of up to 7s seem to happen from time to timeThe situation gets critical if these spikes continue for several minutes
All above linked alerts are on pause right now since they don't provide a big benefit being that flaky.
Updated by okurz about 4 years ago
- Related to action #69667: missing monitoring data for vde after partitions where reordered added
Updated by okurz about 4 years ago
- Tags set to alert
- Target version set to Ready
Updated by okurz about 4 years ago
- Related to action #73165: [osd] Consolidate "expensive+fast" and "cheap+slow" storage after realizing vdc is "cheap+slow" as well added
Updated by okurz about 4 years ago
- Status changed from New to Feedback
- Assignee set to okurz
From what I learned in #73165 I can update current monitoring and alerting in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/375 . I merged but it seems this did not trigger a CI pipeline in master anymore. Did that manually now.
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from Feedback to Workable
- Assignee deleted (
okurz) - Priority changed from Normal to Low
Crossed of the point I have done. The rest is left to be done.
Updated by okurz about 4 years ago
- Status changed from Workable to Resolved
- Assignee set to okurz
- Priority changed from Low to Normal
hm, given that the current state is ok again and we change the partition layout that seldomly I think it is ok like it is. Of course if someone has a cool idea we can rework our salt code, I have recorded that now in #65271
Updated by okurz over 2 years ago
- Related to action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M added
Actions