action #59621
osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000
shows alerting CPU usage and https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000 shows "Disk I/O time for /dev/vdd" alerting.
from chat:
all storage comes from netapp. Not sure what I/O time actually tells us. IO going up and CPU going up may just mean: we're screwed. "CPU going up" is basically a consequence of the slow IO. that's why we got the alerts. apache was roughly writing at ~100MB/s which is not that fast… . cthe highest I saw in htop was 10MB/s per httpd_prefork process. I wonder if infra monitors their virtualization host. I guess all VMs share the same path to the netapp. If this is really our bottleneck we might need to invest into separate hardware (not strictly speaking about a separate server for OSD).
Related issues
History
#1
Updated by okurz almost 3 years ago
- Subject changed from osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage" to osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks
- Priority changed from Normal to Low
#2
Updated by okurz almost 3 years ago
- Target version set to Ready
#3
Updated by okurz over 2 years ago
- Target version changed from Ready to future
so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.
#4
Updated by cdywan about 2 years ago
[Alerting] CPU Load alert Metric name 5 Minutes Average Value 136.320 [OK] CPU Load alert
Saw these this morning, 9.47Z and 10.06Z respectively, so that makes 19 minutes til the CPU went back to the expected state.
I assume this is the correct ticket for this? Although it's suspiciously old, so please feel free to correct me here or let me know if we should have a new ticket :-D
#5
Updated by cdywan about 2 years ago
From 3.27Z to 3.33Z:
[Alerting] CPU Load alert Metric name 5 Minutes Average Value 82.761 [OK] CPU Load alert
#6
Updated by okurz 12 months ago
- Related to action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M added