action #70834
Updated by okurz about 4 years ago
We have several IO time alerts for OSD itself: * https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=46&fullscreen&edit&tab=alert * https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=47&fullscreen&edit&tab=alert * https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=48&fullscreen&edit&tab=alert * https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=57&fullscreen&edit&tab=alert They need to be reworked so that: 1. The right disk is shown for the right purpose (e.g. /dev/vde is not /results any longer) * https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/telegraf-webui.conf#L32 might needs adjustments to store persistent identifier like UUIDs * The panel itself maybe can be generated out of info from salt (mountpoint): https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/webui.dashboard.json#L5769 2. DONE: ~~The The alert thresholds need to be adjusted to not trigger that often~~ often * ~~Spikes Spikes of up to 7s seem to happen from time to time~~ time * ~~The The situation gets critical if these spikes continue for several minutes~~ minutes ~~All All above linked alerts are on pause right now since they don't provide a big benefit being that flaky.~~ flaky.