Project

General

Profile

action #69664

[osd][alert] CPU usage alert: IOwait too high

Added by okurz 6 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-08-06
Due date:
2020-09-11
% Done:

0%

Estimated time:
Tags:

Description

Observation

Message from grafana alerts:

[osd-admins] [Alerting] CPU usage alert

From:   Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id:    <osd-admins.suse.de>
Date:   06/08/2020 17.18
 
[Alerting] CPU usage alert
IOwait too high

Metric name
Value
iowait
48.749

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1596724843107&to=1596727490292 shows one of the problematic times when this happened. We already tried to dump down the alert to trigger less but it seems we are often back in the situation of high IO wait. Could this be related to recent changes to openQA to load more test modules in parallel in openQA?

History

#1 Updated by okurz 6 months ago

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=1596718082634&to=1596728132456 shows how at around 1510 many new jobs were scheduled but it took until around 1705 for high CPU usage, especially IO Load to coincide with a sudden increase of minion jobs, increased number of apache workers (?), spotty HTTP response, higher disk I/O for vdd

#2 Updated by okurz 4 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

#3 Updated by okurz 4 months ago

  • Due date set to 2020-09-11
  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/352

paused alert "CPU usage alert" until MR is effective. Want to wait for feedback and merge the latest in 1-2 days.

#4 Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

got delayed by the gitlab CI pipeline being blocked due to package problems in Factory+Tumbleweed, see #71182 . Now it's merged. Checked on monitor.qa.suse.de in /var/lib/grafana/dashboards that the dashboard config is current. Re-enabled the alert for https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&panelId=23&fullscreen&edit&tab=alert&refresh=30s , it's back to green status.

Also available in: Atom PDF