Project

General

Profile

action #59621

osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks

Added by okurz over 1 year ago. Updated about 1 month ago.

Status:
New
Priority:
Low
Assignee:
-
Target version:
Start date:
2019-11-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000
shows alerting CPU usage and https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000 shows "Disk I/O time for /dev/vdd" alerting.

from chat:

all storage comes from netapp. Not sure what I/O time actually tells us. IO going up and CPU going up may just mean: we're screwed. "CPU going up" is basically a consequence of the slow IO. that's why we got the alerts. apache was roughly writing at ~100MB/s which is not that fast… . cthe highest I saw in htop was 10MB/s per httpd_prefork process. I wonder if infra monitors their virtualization host. I guess all VMs share the same path to the netapp. If this is really our bottleneck we might need to invest into separate hardware (not strictly speaking about a separate server for OSD).

History

#1 Updated by okurz 9 months ago

  • Subject changed from osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage" to osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks
  • Priority changed from Normal to Low

#2 Updated by okurz 9 months ago

  • Target version set to Ready

#3 Updated by okurz 7 months ago

  • Target version changed from Ready to future

so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.

#4 Updated by cdywan about 1 month ago

[Alerting] CPU Load alert
Metric name 5 Minutes Average
Value
136.320

[OK] CPU Load alert

Saw these this morning, 9.47Z and 10.06Z respectively, so that makes 19 minutes til the CPU went back to the expected state.

I assume this is the correct ticket for this? Although it's suspiciously old, so please feel free to correct me here or let me know if we should have a new ticket :-D

#5 Updated by cdywan about 1 month ago

From 3.27Z to 3.33Z:

[Alerting] CPU Load alert
Metric name
5 Minutes Average
Value   
82.761

[OK] CPU Load alert

Also available in: Atom PDF