action #174679: [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #174679

closed

[alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S

Added by okurz 5 months ago. Updated about 2 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-12-23

Due date:

% Done:

Estimated time:

Tags:

alert, infra, baremetal-support

Description

Observation¶

https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=2024-12-23T04:59:10.961Z&to=2024-12-23T05:04:37.311Z&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720
and other instances show a significantly slow response on I/O requests in the range of 10s, see

Acceptance criteria¶

AC1: It is known why I/O increased
AC2: I/O does not continue to increase steadily
AC3: There is no alert anymore about disk I/O on baremetal-support

Rollback actions¶

Remove silence from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana alertname=baremetal-support: Disk I/O time alert

Suggestions¶

Look into the concerning increase of disk i/o time in https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=2024-11-14T12:20:35.343Z&to=2025-01-14T13:25:40.428Z&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720
Check drive metrics for other VMs on qamaster
Check the disk(s) for problems (on the hypervisor host) and potentially fix
Consider if moving to a new machine makes sense

Files

Screenshot_20241223_094515_baremetal_support_disk_io.png (59.7 KB) Screenshot_20241223_094515_baremetal_support_disk_io.png

okurz, 2024-12-23 08:46

Actions

Copy link

Updated by okurz 5 months ago

Description updated (diff)
Priority changed from High to Normal

Actions

Copy link

Updated by robert.richardson 5 months ago

Subject changed from [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) to [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by gpuliti 5 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 4 months ago

Subject changed from [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size: S to [alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S

Actions

Copy link

Updated by okurz 4 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

Updated by okurz 3 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by nicksinger 2 months ago

Status changed from Workable to In Progress
Assignee set to nicksinger

baremetal-support: https://monitor.qa.suse.de/alerting/grafana/edc76c708ebaca9e5a5c8bb98ebe10752004c1b6/view?tab=history
monitor: https://monitor.qa.suse.de/alerting/grafana/05547173ebf57a7bf57c3426f8f46d7398395863/view?tab=history
both are VMs on qamaster with ext4 as root-fs.

Other VMs on qamaster but with btrfs:
tumblesle: https://monitor.qa.suse.de/alerting/grafana/88905688b5e3b380afd0fb9bc16c9aa53c659374/view?tab=history
schort-server: https://monitor.qa.suse.de/alerting/grafana/1ae475e1e8801d59bc6f65b2419e6345e17c4308/view?tab=history

And qamaster itself as hypervisor:
qamaster: https://monitor.qa.suse.de/alerting/grafana/6a0a7742feaf10b0ba18cade30082eaf88a832a6/view?tab=history

The disk containing the VM images:
https://monitor.qa.suse.de/d/GDqamaster/dashboard-for-qamaster?orgId=1&from=now-6M&to=now&timezone=browser&var-datasource=000000001&viewPanel=panel-56720

So it looks like we have a lot of general errors but our alert definitions catch that. The general performance of qamasters disk seems in the range of multiple hundreds of ms (mean 250ms) which is not great but no other system has a bigger problem with that. So something is special about baremetal-support - or was ~3 months ago and longer - because since then, also no wrong alerts on it anymore. I'm not sure yet how to approach this but I read something about slow performance of qcow2-images. Maybe we can just convert this single VM to a raw image.

Actions

Copy link

Updated by openqa_review 2 months ago

Due date set to 2025-04-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by livdywan about 2 months ago

Status changed from In Progress to Feedback

nicksinger wrote in #note-7:

baremetal-support: https://monitor.qa.suse.de/alerting/grafana/edc76c708ebaca9e5a5c8bb98ebe10752004c1b6/view?tab=history
monitor: https://monitor.qa.suse.de/alerting/grafana/05547173ebf57a7bf57c3426f8f46d7398395863/view?tab=history
both are VMs on qamaster with ext4 as root-fs.

Other VMs on qamaster but with btrfs:
tumblesle: https://monitor.qa.suse.de/alerting/grafana/88905688b5e3b380afd0fb9bc16c9aa53c659374/view?tab=history
schort-server: https://monitor.qa.suse.de/alerting/grafana/1ae475e1e8801d59bc6f65b2419e6345e17c4308/view?tab=history

And qamaster itself as hypervisor:
qamaster: https://monitor.qa.suse.de/alerting/grafana/6a0a7742feaf10b0ba18cade30082eaf88a832a6/view?tab=history

The disk containing the VM images:
https://monitor.qa.suse.de/d/GDqamaster/dashboard-for-qamaster?orgId=1&from=now-6M&to=now&timezone=browser&var-datasource=000000001&viewPanel=panel-56720

So it looks like we have a lot of general errors but our alert definitions catch that. The general performance of qamasters disk seems in the range of multiple hundreds of ms (mean 250ms) which is not great but no other system has a bigger problem with that. So something is special about baremetal-support - or was ~3 months ago and longer - because since then, also no wrong alerts on it anymore. I'm not sure yet how to approach this but I read something about slow performance of qcow2-images. Maybe we can just convert this single VM to a raw image.

Did we have a plan for this? I can't remember what was discussed in spoken convo and this ticket has no mention of it. Maybe we can discuss that tomorrow.

Actions

Copy link

#10

Updated by livdywan about 2 months ago

Status changed from Feedback to Workable

@nicksinger plans to change the alert for virtual hosts

Actions

Copy link

#11

Updated by livdywan about 2 months ago

Due date deleted (~~2025-04-10~~)

Actions

Copy link

#12

Updated by livdywan about 2 months ago

Resetting the due date as we had more pressing topics in the meanwhile.

Actions

Copy link

#13

Updated by nicksinger about 2 months ago

Status changed from Workable to Resolved

So I was looking again how I could implement such different alerts based on which hypervisor a VM is on. Trying to understand again if this is hypervisor specific and looking at https://monitor.qa.suse.de/d/GDbaremetal-support/dashboard-for-baremetal-support?orgId=1&from=now-120d&to=now&timezone=browser&var-datasource=000000001&refresh=1m&viewPanel=panel-56720 I realized that since 2025-01-22, baremetal-support has no bigger IO-spikes anymore. So whatever happened on that day should have fixed the performance of that VM. I tried to find some reference in progress but failed to find one - I only remember some RAID problems in the past which @okurz fixed and some hard reboots due to power issues in the room. I guess this leaves AC1 open and we should investigate again if the machine now makes problems.
For now I removed the silence and will not implement and specific alert exception for this (class of) host.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #174679

[alert][FIRING:1] baremetal-support (baremetal-support: Disk I/O time alert Generic disk_io_time_alert_baremetal-support generic) size:S

Observation¶

Acceptance criteria¶

Rollback actions¶

Suggestions¶

Updated by okurz 5 months ago

Updated by robert.richardson 5 months ago

Updated by gpuliti 5 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by okurz 3 months ago

Updated by nicksinger 2 months ago

Updated by openqa_review 2 months ago

Updated by livdywan about 2 months ago

Updated by livdywan about 2 months ago

Updated by livdywan about 2 months ago

Updated by livdywan about 2 months ago

Updated by nicksinger about 2 months ago