action #110269
[alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M
0%
Description
Observation¶
[Alerting] QA-Power8-4-kvm: Disk I/O time alert Metric name Value sdf 11500.000 sdi 11500.000 View your Alert rule [stats.openqa-monitor.qa.suse.de] Go to the Alerts page [stats.openqa-monitor.qa.suse.de]
I paused the alert for now.
Acceptance criteria¶
- AC1: No more similar alerts
- AC2: Relevancy of I/O alerts is understood
Suggestions¶
- No apparent problems due to this (bump values?)
- Research and consider what makes sense
- Check the disk health
- Consider performance
- Unpause the alert but change the notification target for evaluation
Rollback steps¶
- Unpause alert on QA-Power8-4-kvm: Disk I/O time alert
- Unpause alert on QA-Power8-5-kvm: Disk I/O time alert
Related issues
History
#2
Updated by cdywan about 2 months ago
- Subject changed from [alert] QA-Power8-4-kvm: Disk I/O time alert to [alert] QA-Power8-4-kvm: Disk I/O time alert size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Urgent to Normal
#3
Updated by okurz about 2 months ago
- Tags set to reactive work
#4
Updated by okurz about 1 month ago
- Subject changed from [alert] QA-Power8-4-kvm: Disk I/O time alert size:M to [alert] QA-Power8-4-kvm + QA-Power8-5-kvm: Disk I/O time alert size:M
- Description updated (diff)
- Priority changed from Normal to High
Now additionally on QA-Power-8-5-kvm: https://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?tab=alert&viewPanel=56720&orgId=1&from=1653182620821&to=1653297037558 . Extended the ticket and paused the alert accordingly.
#7
Updated by okurz 27 days ago
- Related to action #96242: [alert] Disk I/O time for /dev/vde (/space-slow) alert 2021-07-28 size:M added
#8
Updated by okurz 27 days ago
- Related to action #70834: [alert] Refine I/O time alerts for OSD added
#9
Updated by okurz 27 days ago
- Related to action #59621: osd: Sporadically high CPU and IO load (vdd), grafana alerts "Disk I/O time for /dev/vdd" and "CPU usage", also other disks added
#10
Updated by kraih 26 days ago
There's a redundant 1500ms
alert condition that can be removed: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/695
#11
Updated by kraih 26 days ago
- Status changed from In Progress to Feedback
We've talked about this on Slack a bit, and since this alert applies for both SSD and HDD drives currently, the higher 20000
threshold makes the most sense right now for us. And that should be above the values that triggered recent alerts, making this a non-issue. If it comes up again, we will have to re-evaluate.
#12
Updated by kraih 26 days ago
Regarding disk health, everything seems ok:
QA-Power8-4-kvm:~> sudo smartctl -H /dev/sda smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED QA-Power8-4-kvm:~> sudo smartctl -H /dev/sdb smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
QA-Power8-5-kvm:~> sudo smartctl -H /dev/sda smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED QA-Power8-5-kvm:~> sudo smartctl -H /dev/sdb smartctl 7.2 2021-09-14 r5237 [ppc64le-linux-5.3.18-150300.59.68-default] (SUSE RPM) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
#14
Updated by okurz 26 days ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/695 merged and alive but I found that the alert is still there for openQA workers. Then I realized that you changed the template for "generic" machines. We use a different file for openQA workers. See https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/696 for a follow-up. for a follow-up.
#15
Updated by okurz 26 days ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/696 merged and deployed. https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?orgId=1&editPanel=56720&tab=query doesn't show the alert anymore. I would say we are done :)
#16
Updated by kraih 26 days ago
- Status changed from Feedback to Resolved
okurz wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/696 merged and deployed. https://monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?orgId=1&editPanel=56720&tab=query doesn't show the alert anymore. I would say we are done :)
Great!
#17
Updated by okurz 19 days ago
- Related to action #112196: [alert][sporadic] QA-Power8-4-kvm: Disk I/O time alert size:M added