action #158113
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #158110: [epic] Prevent worker overload
typing issue on ppc64 worker - make CPU load alert more strict size:M
0%
Description
Motivation¶
#158104 shows VNC typing issues. For this in #150983 on purpose we added alerts to alert on too high CPU load. https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694 clearly shows a load consistently in the range of 50-70(!) for mania but no alert triggered. We should crosscheck https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert
and make that alert more strict.
Acceptance criteria¶
- AC1: CPU load alerts trigger for a CPU load15 consistently above 40 as originally planned
Suggestions¶
- Crosscheck https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert or the implementation in code https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941
- We already have "red indicators" in the panels showing the alert conditions are met but we don't have notifications yet. Probably we need to check the alert state history and notification policies in details
- Trigger an artificial alert and verify that we actually receive notifications
- Compare a working alert from the "Alert rules"-overview with the broken "worker-arm1: CPU load alert" definition
- Check the "Notification policies" and what they need to match an alert (e.g.
__contacts__ =~ .*"osd-admins".*
tag)