Actions

action #122842

closed

openQA Project (public) - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

coordination #113674: [epic] Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M

Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M

Added by livdywan over 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-01-09

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Summary¶

With #112733 we got new I/O panels for the webui. Due to the nature of repeating panels we cannot add an alert for the IO time with the current alerting backend we use.

Note that after migrating to "unified alerting" this alert was migrated as well and then explicitly removed again by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/819.

Acceptance criteria¶

AC1: alerts for each disk on the webui with according thresholds

Suggestions¶

Take a look at our previous alerting rule: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/1c505df5e92420d0f266e7ea4b3a049aae892dd5/monitoring/grafana/webui.dashboard.json#L3757-3842

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by livdywan over 2 years ago

Blocked by action #122845: Migrate our Grafana setup to "unified alerting" added

Actions

Copy link

Updated by livdywan over 2 years ago

Status changed from Workable to Blocked

Actions

Copy link

Updated by livdywan over 2 years ago

Tags set to infra
Target version set to Ready

Actions

Copy link

Updated by okurz over 2 years ago

Assignee set to okurz

please only use "Blocked" with an assignee tracking the blocker.

Actions

Copy link

Updated by okurz about 2 years ago

Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)

Actions

Copy link

Updated by mkittler about 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz almost 2 years ago

Looking first at the dashboard of a generic machine as reference https://monitor.qa.suse.de/alerting/grafana/disk_io_time_alert_backup-vm/view?returnTo=%2Fd%2FGDbackup-vm%2Fdashboard-for-backup-vm%3ForgId%3D1%26refresh%3D1m%26editPanel%3D56720%26tab%3Dalert I find an alert condition that says "WHEN avg() OF C IS ABOVE 20000" for disk I/O time so if I/O requests are stuck for +20s we should be alerted, sounds good. Why "C" I don't know. I don't find a query A or B. It's evaluated "every 1m for 5m". The exact query is SELECT non_negative_derivative(mean("io_time"), 1s) FROM "diskio" WHERE ("host" = 'backup-vm' AND ("name" <> 'nvme0n1' OR "name" = 'nvme1n1' OR "name" = 'sda' OR "name" = 'sdb')) AND $timeFilter GROUP BY time($__interval), "name" fill(null) https://monitor.qa.suse.de/alerting/list?search=disk shows us that we have 32 rules for "Disk I/O time alert" covering likely all generic machines and workers.

I played around a bit in https://monitor.qa.suse.de/explore and then configured an alert with query

SELECT non_negative_derivative(mean("io_time"),1s) FROM "diskio" WHERE  ("host" = 'openqa' AND "name" =~ /vd[a-z]$/)  AND $timeFilter GROUP BY time($interval), "name"::tag, *

This excludes devices like "loop0", "vda1", "vda2" so no redundant or irrelevant data. B is "Reduce" with Max on A, mode strict, C is "Threshold" with Input B and "is above" 20000. Alert evaluation behaviour selected "for 5m" and for both "no data" and "execution error" selected "OK". I tried to save and then grafana asks me to fill the field "Evaluation group (interval)". I don't now what I need to put there. Following examples on https://monitor.qa.suse.de/alerting/list IIUC it should be Folder "openQA" and the next field "web UI: Disk I/O time alert" but grafana says cannot contain "/" or "" characters.

Anyway, I created the alert in the grafana webUI, then exported all alerts to a YAML document and included only the section about the new alert in the file dashboard-WebuiDb and adjusted the folder and key ordering on the top level because apparently the files so far have all their keys alphabetically ordered but the export from grafana is not. I doubt this will be a long-term sustainable approach though.

Created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/854

Actions

Copy link

Updated by openqa_review almost 2 years ago

Due date set to 2023-05-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#10

Updated by okurz almost 2 years ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/854

Actions

Copy link

#11

Updated by okurz almost 2 years ago

Due date deleted (~~2023-05-25~~)
Status changed from Feedback to Resolved

I merged the MR and monitored the deployment and restart of grafana. The system journal about grafana on monitor.qa.suse.de did not output any related error messages. The alert showed up as https://monitor.qa.suse.de/alerting/grafana/d471ecd1-2d5c-418b-bd03-2b6b206fd27a/view as expected, showing up as "provisioned", replacing the formerly manually created one. So there was also no problem about using the same uuid from the manually created alert. Maybe even the alert is properly replaced because I did use the same uuid. Now https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=158 shows a green heart for vda but the other panels for vdb and alike do not show that. However https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=158&editPanel=158&tab=alert# shows that we have 5 alert rules corresponding to the five storage devices. So the alerts look to be fine just that they are not linked to the monitoring panels. Might not be pretty but I am arguing that this is good enough.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #122842

Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M

Summary¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by openqa_review almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago