Project

General

Profile

Actions

action #174679

open

[alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S

Added by okurz 3 months ago. Updated 7 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-12-23
Due date:
2025-04-10 (Due in 7 days)
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=2024-12-23T04:59:10.961Z&to=2024-12-23T05:04:37.311Z&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720
and other instances show a significantly slow response on I/O requests in the range of 10s, see

Screenshot_20241223_094515_baremetal_support_disk_io.png

Acceptance criteria

  • AC1: It is known why I/O increased
  • AC2: I/O does not continue to increase steadily
  • AC3: There is no alert anymore about disk I/O on baremetal-support

Rollback actions

Suggestions


Files

Actions #1

Updated by okurz 3 months ago

  • Description updated (diff)
  • Priority changed from High to Normal
Actions #2

Updated by robert.richardson 3 months ago

  • Subject changed from [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) to [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by gpuliti 3 months ago

  • Description updated (diff)
Actions #4

Updated by okurz 2 months ago

  • Subject changed from [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S to [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S
Actions #5

Updated by okurz 2 months ago

  • Target version changed from Ready to Tools - Next
Actions #6

Updated by okurz 30 days ago

  • Target version changed from Tools - Next to Ready
Actions #7

Updated by nicksinger 8 days ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger

baremetal-support: https://monitor.qa.suse.de/alerting/grafana/edc76c708ebaca9e5a5c8bb98ebe10752004c1b6/view?tab=history
monitor: https://monitor.qa.suse.de/alerting/grafana/05547173ebf57a7bf57c3426f8f46d7398395863/view?tab=history
both are VMs on qamaster with ext4 as root-fs.

Other VMs on qamaster but with btrfs:
tumblesle: https://monitor.qa.suse.de/alerting/grafana/88905688b5e3b380afd0fb9bc16c9aa53c659374/view?tab=history
schort-server: https://monitor.qa.suse.de/alerting/grafana/1ae475e1e8801d59bc6f65b2419e6345e17c4308/view?tab=history

And qamaster itself as hypervisor:
qamaster: https://monitor.qa.suse.de/alerting/grafana/6a0a7742feaf10b0ba18cade30082eaf88a832a6/view?tab=history

The disk containing the VM images:
https://monitor.qa.suse.de/d/GDqamaster/dashboard-for-qamaster?orgId=1&from=now-6M&to=now&timezone=browser&var-datasource=000000001&viewPanel=panel-56720

So it looks like we have a lot of general errors but our alert definitions catch that. The general performance of qamasters disk seems in the range of multiple hundreds of ms (mean 250ms) which is not great but no other system has a bigger problem with that. So something is special about baremetal-support - or was ~3 months ago and longer - because since then, also no wrong alerts on it anymore. I'm not sure yet how to approach this but I read something about slow performance of qcow2-images. Maybe we can just convert this single VM to a raw image.

Actions #8

Updated by openqa_review 7 days ago

  • Due date set to 2025-04-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Also available in: Atom PDF