action #59621
closedosd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000
shows alerting CPU usage and https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000 shows "Disk I/O time for /dev/vdd" alerting.
from chat:
all storage comes from netapp. Not sure what I/O time actually tells us. IO going up and CPU going up may just mean: we're screwed. "CPU going up" is basically a consequence of the slow IO. that's why we got the alerts. apache was roughly writing at ~100MB/s which is not that fast… . cthe highest I saw in htop was 10MB/s per httpd_prefork process. I wonder if infra monitors their virtualization host. I guess all VMs share the same path to the netapp. If this is really our bottleneck we might need to invest into separate hardware (not strictly speaking about a separate server for OSD).
Updated by okurz over 4 years ago
- Subject changed from osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage" to osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks
- Priority changed from Normal to Low
Updated by okurz over 4 years ago
- Target version changed from Ready to future
so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.
Updated by livdywan almost 4 years ago
[Alerting] CPU Load alert
Metric name 5 Minutes Average
Value
136.320
[OK] CPU Load alert
Saw these this morning, 9.47Z and 10.06Z respectively, so that makes 19 minutes til the CPU went back to the expected state.
I assume this is the correct ticket for this? Although it's suspiciously old, so please feel free to correct me here or let me know if we should have a new ticket :-D
Updated by livdywan almost 4 years ago
From 3.27Z to 3.33Z:
[Alerting] CPU Load alert
Metric name
5 Minutes Average
Value
82.761
[OK] CPU Load alert
Updated by okurz over 2 years ago
- Related to action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M added
Updated by nicksinger 4 months ago
- Status changed from New to Resolved
- Assignee set to nicksinger
- Target version changed from future to Ready
okurz wrote in #note-3:
so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.
The state history on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=now-90d&to=now&orgId=1&editPanel=158&tab=alert - especially looking for vdd alert instances - shows that we have not seen any alert since at least 2023-08-08 with insanely high times (10079966s and 364858s) which leads me to believe that this was an exception. Current alert times are at 10s which we decided to be reasonable large in the past so no further action needed.