action #59621: osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #59621

closed

osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks

Added by okurz about 5 years ago. Updated 4 months ago.

Status:

Resolved

Priority:

Low

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2019-11-14

Due date:

% Done:

Estimated time:

Tags:

alert

Description

Observation¶

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=23&orgId=1&from=1573686000000&to=1573732800000
shows alerting CPU usage and https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&panelId=48&from=1573722000000&to=1573732800000 shows "Disk I/O time for /dev/vdd" alerting.

from chat:

all storage comes from netapp. Not sure what I/O time actually tells us. IO going up and CPU going up may just mean: we're screwed. "CPU going up" is basically a consequence of the slow IO. that's why we got the alerts. apache was roughly writing at ~100MB/s which is not that fast… . cthe highest I saw in htop was 10MB/s per httpd_prefork process. I wonder if infra monitors their virtualization host. I guess all VMs share the same path to the netapp. If this is really our bottleneck we might need to invest into separate hardware (not strictly speaking about a separate server for OSD).

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Subject changed from osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage" to osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks
Priority changed from Normal to Low

Actions

Copy link

Updated by okurz over 4 years ago

Target version set to Ready

Actions

Copy link

Updated by okurz over 4 years ago

Target version changed from Ready to future

so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.

Actions

Copy link

Updated by livdywan almost 4 years ago

[Alerting] CPU Load alert
Metric name 5 Minutes Average
Value
136.320

[OK] CPU Load alert

Saw these this morning, 9.47Z and 10.06Z respectively, so that makes 19 minutes til the CPU went back to the expected state.

I assume this is the correct ticket for this? Although it's suspiciously old, so please feel free to correct me here or let me know if we should have a new ticket :-D

Actions

Copy link

Updated by livdywan almost 4 years ago

From 3.27Z to 3.33Z:

[Alerting] CPU Load alert
Metric name
5 Minutes Average
Value   
82.761

[OK] CPU Load alert

Actions

Copy link

Updated by okurz over 2 years ago

Related to action #110269: [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M added

Actions

Copy link

Updated by nicksinger 4 months ago

Status changed from New to Resolved
Assignee set to nicksinger
Target version changed from future to Ready

okurz wrote in #note-3:

so far this does not seem to impact our daily operations except for the occasional I/O time alert failing. Consider to bump the I/O time alert thresholds on problems but otherwise do not plan any actions.

The state history on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?from=now-90d&to=now&orgId=1&editPanel=158&tab=alert - especially looking for vdd alert instances - shows that we have not seen any alert since at least 2023-08-08 with insanely high times (10079966s and 364858s) which leads me to believe that this was an exception. Current alert times are at 10s which we decided to be reasonable large in the past so no further action needed.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #59621

osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks

Observation¶

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by livdywan almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by okurz over 2 years ago

Updated by nicksinger 4 months ago