Project

General

Profile

action #110269

[alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M

Added by tinita 2 months ago. Updated 26 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

  [Alerting] QA-Power8-4-kvm: Disk I/O time alert

      Metric name
      Value
      sdf
      11500.000
      sdi
      11500.000

View your Alert rule [stats.openqa-monitor.qa.suse.de]
Go to the Alerts page [stats.openqa-monitor.qa.suse.de]

https://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?tab=alert&viewPanel=56720&orgId=1&refresh=1m

I paused the alert for now.

Acceptance criteria

  • AC1: No more similar alerts
  • AC2: Relevancy of I/O alerts is understood

Suggestions

  • No apparent problems due to this (bump values?)
  • Research and consider what makes sense
  • Check the disk health
  • Consider performance
  • Unpause the alert but change the notification target for evaluation

Rollback steps

  • Unpause alert on QA-Power8-4-kvm: Disk I/O time alert
  • Unpause alert on QA-Power8-5-kvm: Disk I/O time alert

Related issues

Related to openQA Infrastructure - action #96242: [alert] Disk I/O time for /dev/vde (/space-slow) alert 2021-07-28 size:MResolved2021-07-29

Related to openQA Infrastructure - action #70834: [alert] Refine I/O time alerts for OSDResolved2020-09-02

Related to openQA Infrastructure - action #59621: osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disksNew2019-11-14

Related to openQA Infrastructure - action #112196: [alert][sporadic] QA-Power8-4-kvm: Disk I/O time alert size:MResolved2022-06-08

History

#1 Updated by tinita 2 months ago

  • Description updated (diff)

#2 Updated by cdywan about 2 months ago

  • Subject changed from [alert] QA-Power8-4-kvm: Disk I/O time alert to [alert] QA-Power8-4-kvm: Disk I/O time alert size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Urgent to Normal

#3 Updated by okurz about 2 months ago

  • Tags set to reactive work

#4 Updated by okurz about 1 month ago

  • Subject changed from [alert] QA-Power8-4-kvm: Disk I/O time alert size:M to [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M
  • Description updated (diff)
  • Priority changed from Normal to High

#5 Updated by kraih 27 days ago

  • Assignee set to kraih

#6 Updated by kraih 27 days ago

  • Status changed from Workable to In Progress

#7 Updated by okurz 27 days ago

  • Related to action #96242: [alert] Disk I/O time for /dev/vde (/space-slow) alert 2021-07-28 size:M added

#8 Updated by okurz 27 days ago

  • Related to action #70834: [alert] Refine I/O time alerts for OSD added

#9 Updated by okurz 27 days ago

  • Related to action #59621: osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks added

#10 Updated by kraih 26 days ago

There's a redundant 1500ms alert condition that can be removed: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/695

#11 Updated by kraih 26 days ago

  • Status changed from In Progress to Feedback

We've talked about this on Slack a bit, and since this alert applies for both SSD and HDD drives currently, the higher 20000 threshold makes the most sense right now for us. And that should be above the values that triggered recent alerts, making this a non-issue. If it comes up again, we will have to re-evaluate.

#12 Updated by kraih 26 days ago

Regarding disk health, everything seems ok:

QA-Power8-4-kvm:~> sudo smartctl -H /dev/sda
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

QA-Power8-4-kvm:~> sudo smartctl -H /dev/sdb
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
QA-Power8-5-kvm:~> sudo smartctl -H /dev/sda
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

QA-Power8-5-kvm:~> sudo smartctl -H /dev/sdb
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

#13 Updated by kraih 26 days ago

Resumed alerts for now.

#14 Updated by okurz 26 days ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/695 merged and alive but I found that the alert is still there for openQA workers. Then I realized that you changed the template for "generic" machines. We use a different file for openQA workers. See https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/696 for a follow-up. for a follow-up.

#16 Updated by kraih 26 days ago

  • Status changed from Feedback to Resolved

#17 Updated by okurz 19 days ago

  • Related to action #112196: [alert][sporadic] QA-Power8-4-kvm: Disk I/O time alert size:M added

Also available in: Atom PDF