action #69664
[osd][alert] CPU usage alert: IOwait too high
0%
Description
Observation¶
Message from grafana alerts:
[osd-admins] [Alerting] CPU usage alert  From: Grafana <osd-admins@suse.de> To: osd-admins@suse.de Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de> List-Id: <osd-admins.suse.de> Date: 06/08/2020 17.18  [Alerting] CPU usage alert IOwait too high Metric name Value iowait 48.749
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1596724843107&to=1596727490292 shows one of the problematic times when this happened. We already tried to dump down the alert to trigger less but it seems we are often back in the situation of high IO wait. Could this be related to recent changes to openQA to load more test modules in parallel in openQA?
History
#1
Updated by okurz 6 months ago
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=1596718082634&to=1596728132456 shows how at around 1510 many new jobs were scheduled but it took until around 1705 for high CPU usage, especially IO Load to coincide with a sudden increase of minion jobs, increased number of apache workers (?), spotty HTTP response, higher disk I/O for vdd
#3
Updated by okurz 4 months ago
- Due date set to 2020-09-11
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/352
paused alert "CPU usage alert" until MR is effective. Want to wait for feedback and merge the latest in 1-2 days.
#4
Updated by okurz 4 months ago
- Status changed from Feedback to Resolved
got delayed by the gitlab CI pipeline being blocked due to package problems in Factory+Tumbleweed, see #71182 . Now it's merged. Checked on monitor.qa.suse.de in /var/lib/grafana/dashboards that the dashboard config is current. Re-enabled the alert for https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&panelId=23&fullscreen&edit&tab=alert&refresh=30s , it's back to green status.