action #122842
closedopenQA Project - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids
coordination #113674: [epic] Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M
Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M
0%
Description
Summary¶
With #112733 we got new I/O panels for the webui. Due to the nature of repeating panels we cannot add an alert for the IO time with the current alerting backend we use.
Note that after migrating to "unified alerting" this alert was migrated as well and then explicitly removed again by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/819.
Acceptance criteria¶
- AC1: alerts for each disk on the webui with according thresholds
Suggestions¶
- Take a look at our previous alerting rule: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/1c505df5e92420d0f266e7ea4b3a049aae892dd5/monitoring/grafana/webui.dashboard.json#L3757-3842
Updated by livdywan over 1 year ago
- Blocked by action #122845: Migrate our Grafana setup to "unified alerting" added
Updated by livdywan over 1 year ago
- Tags set to infra
- Target version set to Ready
Updated by okurz over 1 year ago
- Assignee set to okurz
please only use "Blocked" with an assignee tracking the blocker.
Updated by okurz over 1 year ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
Updated by okurz over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Updated by okurz over 1 year ago
Looking first at the dashboard of a generic machine as reference https://monitor.qa.suse.de/alerting/grafana/disk_io_time_alert_backup-vm/view?returnTo=%2Fd%2FGDbackup-vm%2Fdashboard-for-backup-vm%3ForgId%3D1%26refresh%3D1m%26editPanel%3D56720%26tab%3Dalert I find an alert condition that says "WHEN avg() OF C IS ABOVE 20000" for disk I/O time so if I/O requests are stuck for +20s we should be alerted, sounds good. Why "C" I don't know. I don't find a query A or B. It's evaluated "every 1m for 5m". The exact query is SELECT non_negative_derivative(mean("io_time"), 1s) FROM "diskio" WHERE ("host" = 'backup-vm' AND ("name" <> 'nvme0n1' OR "name" = 'nvme1n1' OR "name" = 'sda' OR "name" = 'sdb')) AND $timeFilter GROUP BY time($__interval), "name" fill(null)
https://monitor.qa.suse.de/alerting/list?search=disk shows us that we have 32 rules for "Disk I/O time alert" covering likely all generic machines and workers.
I played around a bit in https://monitor.qa.suse.de/explore and then configured an alert with query
SELECT non_negative_derivative(mean("io_time"),1s) FROM "diskio" WHERE ("host" = 'openqa' AND "name" =~ /vd[a-z]$/) AND $timeFilter GROUP BY time($interval), "name"::tag, *
This excludes devices like "loop0", "vda1", "vda2" so no redundant or irrelevant data. B is "Reduce" with Max on A, mode strict, C is "Threshold" with Input B and "is above" 20000. Alert evaluation behaviour selected "for 5m" and for both "no data" and "execution error" selected "OK". I tried to save and then grafana asks me to fill the field "Evaluation group (interval)". I don't now what I need to put there. Following examples on https://monitor.qa.suse.de/alerting/list IIUC it should be Folder "openQA" and the next field "web UI: Disk I/O time alert" but grafana says cannot contain "/" or "\" characters.
Anyway, I created the alert in the grafana webUI, then exported all alerts to a YAML document and included only the section about the new alert in the file dashboard-WebuiDb and adjusted the folder and key ordering on the top level because apparently the files so far have all their keys alphabetically ordered but the export from grafana is not. I doubt this will be a long-term sustainable approach though.
Created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/854
Updated by openqa_review over 1 year ago
- Due date set to 2023-05-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Status changed from In Progress to Feedback
Updated by okurz over 1 year ago
- Due date deleted (
2023-05-25) - Status changed from Feedback to Resolved
I merged the MR and monitored the deployment and restart of grafana. The system journal about grafana on monitor.qa.suse.de did not output any related error messages. The alert showed up as https://monitor.qa.suse.de/alerting/grafana/d471ecd1-2d5c-418b-bd03-2b6b206fd27a/view as expected, showing up as "provisioned", replacing the formerly manually created one. So there was also no problem about using the same uuid from the manually created alert. Maybe even the alert is properly replaced because I did use the same uuid. Now https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=158 shows a green heart for vda but the other panels for vdb and alike do not show that. However https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=158&editPanel=158&tab=alert# shows that we have 5 alert rules corresponding to the five storage devices. So the alerts look to be fine just that they are not linked to the monitoring panels. Might not be pretty but I am arguing that this is good enough.