Project

General

Profile

Actions

action #59621

open

osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks

Added by okurz over 4 years ago. Updated about 3 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
Start date:
2019-11-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000
shows alerting CPU usage and https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000 shows "Disk I/O time for /dev/vdd" alerting.

from chat:

all storage comes from netapp. Not sure what I/O time actually tells us. IO going up and CPU going up may just mean: we're screwed. "CPU going up" is basically a consequence of the slow IO. that's why we got the alerts. apache was roughly writing at ~100MB/s which is not that fast… . cthe highest I saw in htop was 10MB/s per httpd_prefork process. I wonder if infra monitors their virtualization host. I guess all VMs share the same path to the netapp. If this is really our bottleneck we might need to invest into separate hardware (not strictly speaking about a separate server for OSD).


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:MResolvedkraih

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Subject changed from osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage" to osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks
  • Priority changed from Normal to Low
Actions #2

Updated by okurz over 3 years ago

  • Target version set to Ready
Actions #3

Updated by okurz over 3 years ago

  • Target version changed from Ready to future

so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.

Actions #4

Updated by livdywan about 3 years ago

[Alerting] CPU Load alert
Metric name 5 Minutes Average
Value
136.320

[OK] CPU Load alert

Saw these this morning, 9.47Z and 10.06Z respectively, so that makes 19 minutes til the CPU went back to the expected state.

I assume this is the correct ticket for this? Although it's suspiciously old, so please feel free to correct me here or let me know if we should have a new ticket :-D

Actions #5

Updated by livdywan about 3 years ago

From 3.27Z to 3.33Z:

[Alerting] CPU Load alert
Metric name
5 Minutes Average
Value   
82.761

[OK] CPU Load alert
Actions #6

Updated by okurz almost 2 years ago

  • Related to action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M added
Actions

Also available in: Atom PDF