Project

General

Profile

action #70834

Updated by okurz about 1 year ago

We have several IO time alerts for OSD itself:
* https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=46&fullscreen&edit&tab=alert
* https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=47&fullscreen&edit&tab=alert
* https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=48&fullscreen&edit&tab=alert
* https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=57&fullscreen&edit&tab=alert

They need to be reworked so that:
1. The right disk is shown for the right purpose (e.g. /dev/vde is not /results any longer)
* https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/telegraf-webui.conf#L32 might needs adjustments to store persistent identifier like UUIDs
* The panel itself maybe can be generated out of info from salt (mountpoint): https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/webui.dashboard.json#L5769
2. DONE: ~~The The alert thresholds need to be adjusted to not trigger that often~~ often
* ~~Spikes Spikes of up to 7s seem to happen from time to time~~ time
* ~~The The situation gets critical if these spikes continue for several minutes~~ minutes

~~All All above linked alerts are on pause right now since they don't provide a big benefit being that flaky.~~ flaky.

Back