action #69664
closed[osd][alert] CPU usage alert: IOwait too high
0%
Description
Observation¶
Message from grafana alerts:
[osd-admins] [Alerting] CPU usage alert

From: Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id: <osd-admins.suse.de>
Date: 06/08/2020 17.18

[Alerting] CPU usage alert
IOwait too high
Metric name
Value
iowait
48.749
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1596724843107&to=1596727490292 shows one of the problematic times when this happened. We already tried to dump down the alert to trigger less but it seems we are often back in the situation of high IO wait. Could this be related to recent changes to openQA to load more test modules in parallel in openQA?
Updated by okurz over 4 years ago
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=1596718082634&to=1596728132456 shows how at around 1510 many new jobs were scheduled but it took until around 1705 for high CPU usage, especially IO Load to coincide with a sudden increase of minion jobs, increased number of apache workers (?), spotty HTTP response, higher disk I/O for vdd
Updated by okurz over 4 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Updated by okurz over 4 years ago
- Due date set to 2020-09-11
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/352
paused alert "CPU usage alert" until MR is effective. Want to wait for feedback and merge the latest in 1-2 days.
Updated by okurz over 4 years ago
- Status changed from Feedback to Resolved
got delayed by the gitlab CI pipeline being blocked due to package problems in Factory+Tumbleweed, see #71182 . Now it's merged. Checked on monitor.qa.suse.de in /var/lib/grafana/dashboards that the dashboard config is current. Re-enabled the alert for https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&panelId=23&fullscreen&edit&tab=alert&refresh=30s , it's back to green status.