Project

General

Profile

Actions

action #122842

closed

openQA Project (public) - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

coordination #113674: [epic] Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M

Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M

Added by livdywan almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-01-09
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Summary

With #112733 we got new I/O panels for the webui. Due to the nature of repeating panels we cannot add an alert for the IO time with the current alerting backend we use.

Note that after migrating to "unified alerting" this alert was migrated as well and then explicitly removed again by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/819.

Acceptance criteria

  • AC1: alerts for each disk on the webui with according thresholds

Suggestions


Related issues 1 (0 open1 closed)

Blocked by openQA Infrastructure (public) - action #122845: Migrate our Grafana setup to "unified alerting"Resolvednicksinger2023-01-09

Actions
Actions #1

Updated by livdywan almost 2 years ago

  • Blocked by action #122845: Migrate our Grafana setup to "unified alerting" added
Actions #2

Updated by livdywan almost 2 years ago

  • Status changed from Workable to Blocked
Actions #3

Updated by livdywan almost 2 years ago

  • Tags set to infra
  • Target version set to Ready
Actions #4

Updated by okurz almost 2 years ago

  • Assignee set to okurz

please only use "Blocked" with an assignee tracking the blocker.

Actions #5

Updated by okurz almost 2 years ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)
Actions #6

Updated by mkittler over 1 year ago

  • Description updated (diff)
Actions #7

Updated by okurz over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #8

Updated by okurz over 1 year ago

Looking first at the dashboard of a generic machine as reference https://monitor.qa.suse.de/alerting/grafana/disk_io_time_alert_backup-vm/view?returnTo=%2Fd%2FGDbackup-vm%2Fdashboard-for-backup-vm%3ForgId%3D1%26refresh%3D1m%26editPanel%3D56720%26tab%3Dalert I find an alert condition that says "WHEN avg() OF C IS ABOVE 20000" for disk I/O time so if I/O requests are stuck for +20s we should be alerted, sounds good. Why "C" I don't know. I don't find a query A or B. It's evaluated "every 1m for 5m". The exact query is SELECT non_negative_derivative(mean("io_time"), 1s) FROM "diskio" WHERE ("host" = 'backup-vm' AND ("name" <> 'nvme0n1' OR "name" = 'nvme1n1' OR "name" = 'sda' OR "name" = 'sdb')) AND $timeFilter GROUP BY time($__interval), "name" fill(null) https://monitor.qa.suse.de/alerting/list?search=disk shows us that we have 32 rules for "Disk I/O time alert" covering likely all generic machines and workers.

I played around a bit in https://monitor.qa.suse.de/explore and then configured an alert with query

SELECT non_negative_derivative(mean("io_time"),1s) FROM "diskio" WHERE  ("host" = 'openqa' AND "name" =~ /vd[a-z]$/)  AND $timeFilter GROUP BY time($interval), "name"::tag, *

This excludes devices like "loop0", "vda1", "vda2" so no redundant or irrelevant data. B is "Reduce" with Max on A, mode strict, C is "Threshold" with Input B and "is above" 20000. Alert evaluation behaviour selected "for 5m" and for both "no data" and "execution error" selected "OK". I tried to save and then grafana asks me to fill the field "Evaluation group (interval)". I don't now what I need to put there. Following examples on https://monitor.qa.suse.de/alerting/list IIUC it should be Folder "openQA" and the next field "web UI: Disk I/O time alert" but grafana says cannot contain "/" or "\" characters.

Anyway, I created the alert in the grafana webUI, then exported all alerts to a YAML document and included only the section about the new alert in the file dashboard-WebuiDb and adjusted the folder and key ordering on the top level because apparently the files so far have all their keys alphabetically ordered but the export from grafana is not. I doubt this will be a long-term sustainable approach though.

Created
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/854

Actions #9

Updated by openqa_review over 1 year ago

  • Due date set to 2023-05-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz over 1 year ago

  • Status changed from In Progress to Feedback
Actions #11

Updated by okurz over 1 year ago

  • Due date deleted (2023-05-25)
  • Status changed from Feedback to Resolved

I merged the MR and monitored the deployment and restart of grafana. The system journal about grafana on monitor.qa.suse.de did not output any related error messages. The alert showed up as https://monitor.qa.suse.de/alerting/grafana/d471ecd1-2d5c-418b-bd03-2b6b206fd27a/view as expected, showing up as "provisioned", replacing the formerly manually created one. So there was also no problem about using the same uuid from the manually created alert. Maybe even the alert is properly replaced because I did use the same uuid. Now https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=158 shows a green heart for vda but the other panels for vdb and alike do not show that. However https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=158&editPanel=158&tab=alert# shows that we have 5 alert rules corresponding to the five storage devices. So the alerts look to be fine just that they are not linked to the monitoring panels. Might not be pretty but I am arguing that this is good enough.

Actions

Also available in: Atom PDF