Project

General

Profile

action #70834

Updated by okurz over 3 years ago

We have several IO time alerts for OSD itself: 
 * https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=46&fullscreen&edit&tab=alert 
 * https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=47&fullscreen&edit&tab=alert 
 * https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=48&fullscreen&edit&tab=alert 
 * https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?panelId=57&fullscreen&edit&tab=alert 

 They need to be reworked so that: 
 1. The right disk is shown for the right purpose (e.g. /dev/vde is not /results any longer) 
   * https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/telegraf-webui.conf#L32 might needs adjustments to store persistent identifier like UUIDs 
   * The panel itself maybe can be generated out of info from salt (mountpoint): https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/monitoring/grafana/webui.dashboard.json#L5769 
 2. DONE: ~~The The alert thresholds need to be adjusted to not trigger that often~~ often 
   * ~~Spikes Spikes of up to 7s seem to happen from time to time~~ time 
   * ~~The The situation gets critical if these spikes continue for several minutes~~ minutes 

 ~~All All above linked alerts are on pause right now since they don't provide a big benefit being that flaky.~~ flaky.

Back