Looking first at the dashboard of a generic machine as reference https://monitor.qa.suse.de/alerting/grafana/disk_io_time_alert_backup-vm/view?returnTo=%2Fd%2FGDbackup-vm%2Fdashboard-for-backup-vm%3ForgId%3D1%26refresh%3D1m%26editPanel%3D56720%26tab%3Dalert I find an alert condition that says "WHEN avg() OF C IS ABOVE 20000" for disk I/O time so if I/O requests are stuck for +20s we should be alerted, sounds good. Why "C" I don't know. I don't find a query A or B. It's evaluated "every 1m for 5m". The exact query is SELECT non_negative_derivative(mean("io_time"), 1s) FROM "diskio" WHERE ("host" = 'backup-vm' AND ("name" <> 'nvme0n1' OR "name" = 'nvme1n1' OR "name" = 'sda' OR "name" = 'sdb')) AND $timeFilter GROUP BY time($__interval), "name" fill(null)
https://monitor.qa.suse.de/alerting/list?search=disk shows us that we have 32 rules for "Disk I/O time alert" covering likely all generic machines and workers.
I played around a bit in https://monitor.qa.suse.de/explore and then configured an alert with query
SELECT non_negative_derivative(mean("io_time"),1s) FROM "diskio" WHERE ("host" = 'openqa' AND "name" =~ /vd[a-z]$/) AND $timeFilter GROUP BY time($interval), "name"::tag, *
This excludes devices like "loop0", "vda1", "vda2" so no redundant or irrelevant data. B is "Reduce" with Max on A, mode strict, C is "Threshold" with Input B and "is above" 20000. Alert evaluation behaviour selected "for 5m" and for both "no data" and "execution error" selected "OK". I tried to save and then grafana asks me to fill the field "Evaluation group (interval)". I don't now what I need to put there. Following examples on https://monitor.qa.suse.de/alerting/list IIUC it should be Folder "openQA" and the next field "web UI: Disk I/O time alert" but grafana says cannot contain "/" or "\" characters.
Anyway, I created the alert in the grafana webUI, then exported all alerts to a YAML document and included only the section about the new alert in the file dashboard-WebuiDb and adjusted the folder and key ordering on the top level because apparently the files so far have all their keys alphabetically ordered but the export from grafana is not. I doubt this will be a long-term sustainable approach though.
Created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/854