Project

General

Profile

action #158113

Updated by okurz about 1 month ago

## Motivation 
 #158104 shows VNC typing issues. For this in #150983 on purpose we added alerts to alert on too high CPU load. https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694 clearly shows a load consistently in the range of 50-70(!) for mania but no alert triggered. We should crosscheck https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert 
 and make that alert more strict. 

 ## Acceptance criteria 
 * **AC1:** CPU load alerts trigger for a CPU load15 consistently above 40 as originally planned 

 ## Suggestions 
 * Crosscheck https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert or the implementation in code https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941 
 * We already have "red indicators" in the panels showing the alert conditions are met but we don't have notifications yet. Probably we need to check the alert state history and notification policies in details 
 * Trigger an artificial alert and verify that we actually receive notifications 
 * Compare a working alert from the "Alert rules"-overview with the broken "worker-arm1: CPU load alert" definition 
 * Check the "Notification policies" and what they need to match an alert (e.g.  
 `__contacts__ =~ .*"osd-admins".*` tag)

Back