action #158113
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #158110: [epic] Prevent worker overload
typing issue on ppc64 worker - make CPU load alert more strict size:M
0%
Description
Motivation¶
#158104 shows VNC typing issues. For this in #150983 on purpose we added alerts to alert on too high CPU load. https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694 clearly shows a load consistently in the range of 50-70(!) for mania but no alert triggered. We should crosscheck https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert
and make that alert more strict.
Acceptance criteria¶
- AC1: CPU load alerts trigger for a CPU load15 consistently above 40 as originally planned
Suggestions¶
- Crosscheck https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert or the implementation in code https://gitlab.suse.de/openqa/salt-states-openqa/-/blame/master/monitoring/grafana/alerting-dashboard-WD.yaml.template?ref_type=heads#L941
- We already have "red indicators" in the panels showing the alert conditions are met but we don't have notifications yet. Probably we need to check the alert state history and notification policies in details
- Trigger an artificial alert and verify that we actually receive notifications
- Compare a working alert from the "Alert rules"-overview with the broken "worker-arm1: CPU load alert" definition
- Check the "Notification policies" and what they need to match an alert (e.g.
__contacts__ =~ .*"osd-admins".*
tag)
Updated by okurz 9 months ago
- Copied from action #158104: typing issue on ppc64 worker size:S added
Updated by okurz 9 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1133 was merged but I doubt it's effective. Help?
Updated by okurz 9 months ago ยท Edited
got help. https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1137 to ignore NaN in alert evaluation. Will need more days to monitor the impact then.
https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?viewPanel=54694&orgId=1&from=1711742324943&to=1711831309898 looks like a post-evaluation of how alerts would be triggered and shows that actually alerts would have triggered. Let's see if more systems actually trigger alerts in the upcoming days.
Updated by okurz 9 months ago
- Status changed from Resolved to New
- Assignee deleted (
okurz)
Still no effective alert notifications, e.g. from https://monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1&viewPanel=54694&from=1712517495427&to=1712537836946&editPanel=54694&tab=alert
Updated by okurz 9 months ago
- Due date set to 2024-04-22
- Status changed from In Progress to Feedback
- Priority changed from High to Low
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1142 merged. Monitoring for alerts.
Updated by okurz 9 months ago
alerts were triggered so it works. Bumped threshold with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1143
Updated by okurz 8 months ago
- Due date deleted (
2024-04-22) - Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1143 merged. Alert received which will be followed up with by mkittler, the alert is valid.