action #110269
closed[alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M
0%
Description
Observation¶
[Alerting] QA-Power8-4-kvm: Disk I/O time alert
Metric name
Value
sdf
11500.000
sdi
11500.000
View your Alert rule [stats.openqa-monitor.qa.suse.de]
Go to the Alerts page [stats.openqa-monitor.qa.suse.de]
I paused the alert for now.
Acceptance criteria¶
- AC1: No more similar alerts
- AC2: Relevancy of I/O alerts is understood
Suggestions¶
- No apparent problems due to this (bump values?)
- Research and consider what makes sense
- Check the disk health
- Consider performance
- Unpause the alert but change the notification target for evaluation
Rollback steps¶
- Unpause alert on QA-Power8-4-kvm: Disk I/O time alert
- Unpause alert on QA-Power8-5-kvm: Disk I/O time alert
Updated by livdywan over 2 years ago
- Subject changed from [alert] QA-Power8-4-kvm: Disk I/O time alert to [alert] QA-Power8-4-kvm: Disk I/O time alert size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Urgent to Normal
Updated by okurz over 2 years ago
- Subject changed from [alert] QA-Power8-4-kvm: Disk I/O time alert size:M to [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M
- Description updated (diff)
- Priority changed from Normal to High
Now additionally on QA-Power-8-5-kvm: https://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?tab=alert&viewPanel=56720&orgId=1&from=1653182620821&to=1653297037558 . Extended the ticket and paused the alert accordingly.
Updated by okurz over 2 years ago
- Related to action #96242: [alert] Disk I/O time for /dev/vde (/space-slow) alert 2021-07-28 size:M added
Updated by okurz over 2 years ago
- Related to action #70834: [alert] Refine I/O time alerts for OSD added
Updated by okurz over 2 years ago
- Related to action #59621: osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks added
Updated by kraih over 2 years ago
There's a redundant 1500ms
alert condition that can be removed: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/695
Updated by kraih over 2 years ago
- Status changed from In Progress to Feedback
We've talked about this on Slack a bit, and since this alert applies for both SSD and HDD drives currently, the higher 20000
threshold makes the most sense right now for us. And that should be above the values that triggered recent alerts, making this a non-issue. If it comes up again, we will have to re-evaluate.
Updated by kraih over 2 years ago
Regarding disk health, everything seems ok:
QA-Power8-4-kvm:~> sudo smartctl -H /dev/sda
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
QA-Power8-4-kvm:~> sudo smartctl -H /dev/sdb
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
QA-Power8-5-kvm:~> sudo smartctl -H /dev/sda
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
QA-Power8-5-kvm:~> sudo smartctl -H /dev/sdb
smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Updated by okurz over 2 years ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/695 merged and alive but I found that the alert is still there for openQA workers. Then I realized that you changed the template for "generic" machines. We use a different file for openQA workers. See https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/696 for a follow-up. for a follow-up.
Updated by okurz over 2 years ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/696 merged and deployed. https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?orgId=1&editPanel=56720&tab=query doesn't show the alert anymore. I would say we are done :)
Updated by kraih over 2 years ago
- Status changed from Feedback to Resolved
okurz wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/696 merged and deployed. https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?orgId=1&editPanel=56720&tab=query doesn't show the alert anymore. I would say we are done :)
Great!
Updated by okurz over 2 years ago
- Related to action #112196: [alert][sporadic] QA-Power8-4-kvm: Disk I/O time alert size:M added