action #174679
open[alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S
0%
Description
Observation¶
https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=2024-12-23T04:59:10.961Z&to=2024-12-23T05:04:37.311Z&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720
and other instances show a significantly slow response on I/O requests in the range of 10s, see
Acceptance criteria¶
- AC1: It is known why I/O increased
- AC2: I/O does not continue to increase steadily
- AC3: There is no alert anymore about disk I/O on baremetal-support
Rollback actions¶
- Remove silence from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana
alertname=baremetal-support: Disk I/O time alert
Suggestions¶
- Look into the concerning increase of disk i/o time in https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=2024-11-14T12:20:35.343Z&to=2025-01-14T13:25:40.428Z&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720
- Check drive metrics for other VMs on qamaster
- Check the disk(s) for problems (on the hypervisor host) and potentially fix
- Consider if moving to a new machine makes sense
Files
Updated by robert.richardson 3 months ago
- Subject changed from [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) to [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz 2 months ago
- Subject changed from [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S to [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S
Updated by nicksinger 8 days ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
baremetal-support: https://monitor.qa.suse.de/alerting/grafana/edc76c708ebaca9e5a5c8bb98ebe10752004c1b6/view?tab=history
monitor: https://monitor.qa.suse.de/alerting/grafana/05547173ebf57a7bf57c3426f8f46d7398395863/view?tab=history
both are VMs on qamaster with ext4 as root-fs.
Other VMs on qamaster but with btrfs:
tumblesle: https://monitor.qa.suse.de/alerting/grafana/88905688b5e3b380afd0fb9bc16c9aa53c659374/view?tab=history
schort-server: https://monitor.qa.suse.de/alerting/grafana/1ae475e1e8801d59bc6f65b2419e6345e17c4308/view?tab=history
And qamaster itself as hypervisor:
qamaster: https://monitor.qa.suse.de/alerting/grafana/6a0a7742feaf10b0ba18cade30082eaf88a832a6/view?tab=history
The disk containing the VM images:
https://monitor.qa.suse.de/d/GDqamaster/dashboard-for-qamaster?orgId=1&from=now-6M&to=now&timezone=browser&var-datasource=000000001&viewPanel=panel-56720
So it looks like we have a lot of general errors but our alert definitions catch that. The general performance of qamasters disk seems in the range of multiple hundreds of ms (mean 250ms) which is not great but no other system has a bigger problem with that. So something is special about baremetal-support - or was ~3 months ago and longer - because since then, also no wrong alerts on it anymore. I'm not sure yet how to approach this but I read something about slow performance of qcow2-images. Maybe we can just convert this single VM to a raw image.
Updated by openqa_review 7 days ago
- Due date set to 2025-04-10
Setting due date based on mean cycle time of SUSE QE Tools