Project

General

Profile

Actions

action #110269

closed

[alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M

Added by tinita almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

  [Alerting] QA-Power8-4-kvm: Disk I/O time alert

      Metric name
      Value
      sdf
      11500.000
      sdi
      11500.000

View your Alert rule [stats.openqa-monitor.qa.suse.de]
Go to the Alerts page [stats.openqa-monitor.qa.suse.de]

https://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?tab=alert&viewPanel=56720&orgId=1&refresh=1m

I paused the alert for now.

Acceptance criteria

  • AC1: No more similar alerts
  • AC2: Relevancy of I/O alerts is understood

Suggestions

  • No apparent problems due to this (bump values?)
  • Research and consider what makes sense
  • Check the disk health
  • Consider performance
  • Unpause the alert but change the notification target for evaluation

Rollback steps

  • Unpause alert on QA-Power8-4-kvm: Disk I/O time alert
  • Unpause alert on QA-Power8-5-kvm: Disk I/O time alert

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure - action #96242: [alert] Disk I/O time for /dev/vde (/space-slow) alert 2021-07-28 size:MResolvedmkittler2021-07-29

Actions
Related to openQA Infrastructure - action #70834: [alert] Refine I/O time alerts for OSDResolvedokurz2020-09-02

Actions
Related to openQA Infrastructure - action #59621: osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disksNew2019-11-14

Actions
Related to openQA Infrastructure - action #112196: [alert][sporadic] QA-Power8-4-kvm: Disk I/O time alert size:MResolvedokurz2022-06-08

Actions
Actions #1

Updated by tinita almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by livdywan almost 2 years ago

  • Subject changed from [alert] QA-Power8-4-kvm: Disk I/O time alert to [alert] QA-Power8-4-kvm: Disk I/O time alert size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Urgent to Normal
Actions #3

Updated by okurz almost 2 years ago

  • Tags set to reactive work
Actions #4

Updated by okurz almost 2 years ago

  • Subject changed from [alert] QA-Power8-4-kvm: Disk I/O time alert size:M to [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M
  • Description updated (diff)
  • Priority changed from Normal to High
Actions #5

Updated by kraih almost 2 years ago

  • Assignee set to kraih
Actions #6

Updated by kraih almost 2 years ago

  • Status changed from Workable to In Progress
Actions #7

Updated by okurz almost 2 years ago

  • Related to action #96242: [alert] Disk I/O time for /dev/vde (/space-slow) alert 2021-07-28 size:M added
Actions #8

Updated by okurz almost 2 years ago

  • Related to action #70834: [alert] Refine I/O time alerts for OSD added
Actions #9

Updated by okurz almost 2 years ago

  • Related to action #59621: osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks added
Actions #10

Updated by kraih almost 2 years ago

There's a redundant 1500ms alert condition that can be removed: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/695

Actions #11

Updated by kraih almost 2 years ago

  • Status changed from In Progress to Feedback

We've talked about this on Slack a bit, and since this alert applies for both SSD and HDD drives currently, the higher 20000 threshold makes the most sense right now for us. And that should be above the values that triggered recent alerts, making this a non-issue. If it comes up again, we will have to re-evaluate.

Actions #12

Updated by kraih almost 2 years ago

Regarding disk health, everything seems ok:

QA-Power8-4-kvm:~> sudo smartctl -H /dev/sda
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

QA-Power8-4-kvm:~> sudo smartctl -H /dev/sdb
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
QA-Power8-5-kvm:~> sudo smartctl -H /dev/sda
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

QA-Power8-5-kvm:~> sudo smartctl -H /dev/sdb
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Actions #13

Updated by kraih almost 2 years ago

Resumed alerts for now.

Actions #14

Updated by okurz almost 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/695 merged and alive but I found that the alert is still there for openQA workers. Then I realized that you changed the template for "generic" machines. We use a different file for openQA workers. See https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/696 for a follow-up. for a follow-up.

Actions #16

Updated by kraih almost 2 years ago

  • Status changed from Feedback to Resolved
Actions #17

Updated by okurz almost 2 years ago

  • Related to action #112196: [alert][sporadic] QA-Power8-4-kvm: Disk I/O time alert size:M added
Actions

Also available in: Atom PDF