[osd][alert] CPU usage alert: IOwait too high
Message from grafana alerts:
[osd-admins] [Alerting] CPU usage alert ￼ From: Grafana <email@example.com> To: firstname.lastname@example.org Sender: osd-admins <email@example.com> List-Id: <osd-admins.suse.de> Date: 06/08/2020 17.18 ￼ [Alerting] CPU usage alert IOwait too high Metric name Value iowait 48.749
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1596724843107&to=1596727490292 shows one of the problematic times when this happened. We already tried to dump down the alert to trigger less but it seems we are often back in the situation of high IO wait. Could this be related to recent changes to openQA to load more test modules in parallel in openQA?
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=1596718082634&to=1596728132456 shows how at around 1510 many new jobs were scheduled but it took until around 1705 for high CPU usage, especially IO Load to coincide with a sudden increase of minion jobs, increased number of apache workers (?), spotty HTTP response, higher disk I/O for vdd
- Due date set to 2020-09-11
- Status changed from In Progress to Feedback
paused alert "CPU usage alert" until MR is effective. Want to wait for feedback and merge the latest in 1-2 days.
- Status changed from Feedback to Resolved
got delayed by the gitlab CI pipeline being blocked due to package problems in Factory+Tumbleweed, see #71182 . Now it's merged. Checked on monitor.qa.suse.de in /var/lib/grafana/dashboards that the dashboard config is current. Re-enabled the alert for https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&panelId=23&fullscreen&edit&tab=alert&refresh=30s , it's back to green status.