Project

General

Profile

Actions

action #59621

closed

osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks

Added by okurz about 5 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2019-11-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000
shows alerting CPU usage and https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000 shows "Disk I/O time for /dev/vdd" alerting.

from chat:

all storage comes from netapp. Not sure what I/O time actually tells us. IO going up and CPU going up may just mean: we're screwed. "CPU going up" is basically a consequence of the slow IO. that's why we got the alerts. apache was roughly writing at ~100MB/s which is not that fast… . cthe highest I saw in htop was 10MB/s per httpd_prefork process. I wonder if infra monitors their virtualization host. I guess all VMs share the same path to the netapp. If this is really our bottleneck we might need to invest into separate hardware (not strictly speaking about a separate server for OSD).


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:MResolvedkraih

Actions
Actions #1

Updated by okurz over 4 years ago

  • Subject changed from osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage" to osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks
  • Priority changed from Normal to Low
Actions #2

Updated by okurz over 4 years ago

  • Target version set to Ready
Actions #3

Updated by okurz about 4 years ago

  • Target version changed from Ready to future

so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.

Actions #4

Updated by livdywan over 3 years ago

[Alerting] CPU Load alert
Metric name 5 Minutes Average
Value
136.320

[OK] CPU Load alert

Saw these this morning, 9.47Z and 10.06Z respectively, so that makes 19 minutes til the CPU went back to the expected state.

I assume this is the correct ticket for this? Although it's suspiciously old, so please feel free to correct me here or let me know if we should have a new ticket :-D

Actions #5

Updated by livdywan over 3 years ago

From 3.27Z to 3.33Z:

[Alerting] CPU Load alert
Metric name
5 Minutes Average
Value   
82.761

[OK] CPU Load alert
Actions #6

Updated by okurz over 2 years ago

  • Related to action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M added
Actions #7

Updated by nicksinger 3 months ago

  • Status changed from New to Resolved
  • Assignee set to nicksinger
  • Target version changed from future to Ready

okurz wrote in #note-3:

so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.

The state history on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=now-90d&to=now&orgId=1&editPanel=158&tab=alert - especially looking for vdd alert instances - shows that we have not seen any alert since at least 2023-08-08 with insanely high times (10079966s and 364858s) which leads me to believe that this was an exception. Current alert times are at 10s which we decided to be reasonable large in the past so no further action needed.

Actions

Also available in: Atom PDF