Project

General

Profile

action #174679

Updated by robert.richardson 5 months ago

[alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S 

 ## Observation 
 https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=2024-12-23T04:59:10.961Z&to=2024-12-23T05:04:37.311Z&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720 
 and other instances show a significantly slow response on I/O requests in the range of 10s, see 

 ![Screenshot_20241223_094515_baremetal_support_disk_io.png](Screenshot_20241223_094515_baremetal_support_disk_io.png) 

 ## Acceptance criteria 
 * **AC1**: It is known why I/O increased 
 * **AC2**: I/O does not continue to increase steadily 
 * **AC3:** There is no alert anymore about disk I/O on baremetal-support 

 ## Rollback actions 
 * Remove silence from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana `alertname=baremetal-support: Disk I/O time alert` 

 ## Suggestions 
 * Look into the concerning increase of disk i/o time in https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=2024-11-14T12:20:35.343Z&to=2025-01-14T13:25:40.428Z&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720 
 * Check drive metrics for other VMs on qamaster 
 * Check the disk(s) for problems (on the hypervisor host) and potentially fix 
 * Consider if moving to a new machine makes sense

Back