Project

General

Profile

Actions

action #174679

closed

[alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S

Added by okurz 5 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-12-23
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=2024-12-23T04:59:10.961Z&to=2024-12-23T05:04:37.311Z&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720
and other instances show a significantly slow response on I/O requests in the range of 10s, see

Screenshot_20241223_094515_baremetal_support_disk_io.png

Acceptance criteria

  • AC1: It is known why I/O increased
  • AC2: I/O does not continue to increase steadily
  • AC3: There is no alert anymore about disk I/O on baremetal-support

Rollback actions

Suggestions


Files

Actions #1

Updated by okurz 5 months ago

  • Description updated (diff)
  • Priority changed from High to Normal
Actions #2

Updated by robert.richardson 5 months ago

  • Subject changed from [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) to [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by gpuliti 5 months ago

  • Description updated (diff)
Actions #4

Updated by okurz 4 months ago

  • Subject changed from [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S to [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S
Actions #5

Updated by okurz 4 months ago

  • Target version changed from Ready to Tools - Next
Actions #6

Updated by okurz 3 months ago

  • Target version changed from Tools - Next to Ready
Actions #7

Updated by nicksinger 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger

baremetal-support: https://monitor.qa.suse.de/alerting/grafana/edc76c708ebaca9e5a5c8bb98ebe10752004c1b6/view?tab=history
monitor: https://monitor.qa.suse.de/alerting/grafana/05547173ebf57a7bf57c3426f8f46d7398395863/view?tab=history
both are VMs on qamaster with ext4 as root-fs.

Other VMs on qamaster but with btrfs:
tumblesle: https://monitor.qa.suse.de/alerting/grafana/88905688b5e3b380afd0fb9bc16c9aa53c659374/view?tab=history
schort-server: https://monitor.qa.suse.de/alerting/grafana/1ae475e1e8801d59bc6f65b2419e6345e17c4308/view?tab=history

And qamaster itself as hypervisor:
qamaster: https://monitor.qa.suse.de/alerting/grafana/6a0a7742feaf10b0ba18cade30082eaf88a832a6/view?tab=history

The disk containing the VM images:
https://monitor.qa.suse.de/d/GDqamaster/dashboard-for-qamaster?orgId=1&from=now-6M&to=now&timezone=browser&var-datasource=000000001&viewPanel=panel-56720

So it looks like we have a lot of general errors but our alert definitions catch that. The general performance of qamasters disk seems in the range of multiple hundreds of ms (mean 250ms) which is not great but no other system has a bigger problem with that. So something is special about baremetal-support - or was ~3 months ago and longer - because since then, also no wrong alerts on it anymore. I'm not sure yet how to approach this but I read something about slow performance of qcow2-images. Maybe we can just convert this single VM to a raw image.

Actions #8

Updated by openqa_review 2 months ago

  • Due date set to 2025-04-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by livdywan about 2 months ago

  • Status changed from In Progress to Feedback

nicksinger wrote in #note-7:

baremetal-support: https://monitor.qa.suse.de/alerting/grafana/edc76c708ebaca9e5a5c8bb98ebe10752004c1b6/view?tab=history
monitor: https://monitor.qa.suse.de/alerting/grafana/05547173ebf57a7bf57c3426f8f46d7398395863/view?tab=history
both are VMs on qamaster with ext4 as root-fs.

Other VMs on qamaster but with btrfs:
tumblesle: https://monitor.qa.suse.de/alerting/grafana/88905688b5e3b380afd0fb9bc16c9aa53c659374/view?tab=history
schort-server: https://monitor.qa.suse.de/alerting/grafana/1ae475e1e8801d59bc6f65b2419e6345e17c4308/view?tab=history

And qamaster itself as hypervisor:
qamaster: https://monitor.qa.suse.de/alerting/grafana/6a0a7742feaf10b0ba18cade30082eaf88a832a6/view?tab=history

The disk containing the VM images:
https://monitor.qa.suse.de/d/GDqamaster/dashboard-for-qamaster?orgId=1&from=now-6M&to=now&timezone=browser&var-datasource=000000001&viewPanel=panel-56720

So it looks like we have a lot of general errors but our alert definitions catch that. The general performance of qamasters disk seems in the range of multiple hundreds of ms (mean 250ms) which is not great but no other system has a bigger problem with that. So something is special about baremetal-support - or was ~3 months ago and longer - because since then, also no wrong alerts on it anymore. I'm not sure yet how to approach this but I read something about slow performance of qcow2-images. Maybe we can just convert this single VM to a raw image.

Did we have a plan for this? I can't remember what was discussed in spoken convo and this ticket has no mention of it. Maybe we can discuss that tomorrow.

Actions #10

Updated by livdywan about 2 months ago

  • Status changed from Feedback to Workable

@nicksinger plans to change the alert for virtual hosts

Actions #11

Updated by livdywan about 2 months ago

  • Due date deleted (2025-04-10)
Actions #12

Updated by livdywan about 2 months ago

Resetting the due date as we had more pressing topics in the meanwhile.

Actions #13

Updated by nicksinger about 2 months ago

  • Status changed from Workable to Resolved

So I was looking again how I could implement such different alerts based on which hypervisor a VM is on. Trying to understand again if this is hypervisor specific and looking at https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=now-120d&to=now&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720 I realized that since 2025-01-22, baremetal-support has no bigger IO-spikes anymore. So whatever happened on that day should have fixed the performance of that VM. I tried to find some reference in progress but failed to find one - I only remember some RAID problems in the past which @okurz fixed and some hard reboots due to power issues in the room. I guess this leaves AC1 open and we should investigate again if the machine now makes problems.
For now I removed the silence and will not implement and specific alert exception for this (class of) host.

Actions

Also available in: Atom PDF