Project

General

Profile

Actions

action #122845

closed

openQA Project - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

coordination #113674: [epic] Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M

Migrate our Grafana setup to "unified alerting"

Added by livdywan over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-01-09
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Summary

With #112733 we got new I/O panels for the webui. Due to the nature of repeating panels we cannot add an alert for the IO time with the current alerting backend we use. This should be possible with unified alerting: https://grafana.com/blog/2021/06/14/the-new-unified-alerting-system-for-grafana-everything-you-need-to-know/

Acceptance criteria

  • AC1: alerts work at least the way they did before
  • AC2: the Grafana instance runs stable with unified alerting

Suggestions

  • Find out how to migrate to the new system, automatically/ manually
  • Try out with an official test instance of Grafana available from their website
  • Test with a container
  • Confirm what we end up with e.g. new JSON or different layout
  • Keep in mind this is the default for Grafana 10 and our current setup may not be supportable long-term

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #125303: prevent confusing "no data" alerts size:MResolvednicksinger2023-03-022023-04-07

Actions
Blocks openQA Infrastructure - action #122842: Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:MResolvedokurz2023-01-09

Actions
Blocks openQA Infrastructure - action #122848: Configure grouped alerts in Grafana correctly size:MResolvedokurz2023-01-09

Actions
Actions #1

Updated by livdywan over 1 year ago

  • Blocks action #122842: Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M added
Actions #2

Updated by livdywan over 1 year ago

  • Blocks action #122848: Configure grouped alerts in Grafana correctly size:M added
Actions #3

Updated by livdywan over 1 year ago

  • Tags set to infra
Actions #4

Updated by livdywan over 1 year ago

  • Target version set to Ready
Actions #5

Updated by robert.richardson over 1 year ago

  • Assignee set to robert.richardson
Actions #6

Updated by robert.richardson over 1 year ago

  • Status changed from Workable to In Progress
Actions #7

Updated by openqa_review over 1 year ago

  • Due date set to 2023-01-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by livdywan over 1 year ago

We talked about it in the unblock. Robert's trying to setup Grafana locally, since we don't want to just upgrade in production. We could have that setup on a staging machine i.e. openqa-staging-1.qa.suse.de or openqa-staging-2.qa.suse.de, assuming the setup is basically the same effort and others can more easily check it and help out.

Actions #9

Updated by robert.richardson over 1 year ago

  • Status changed from In Progress to Workable
  • Assignee deleted (robert.richardson)

The grafana instance on staging-1 is up, for the setup i've copied the grafana database /var/lib/grafana/grafana.db as well as the auth section of /etc/grafana/grafana.ini and the ldap setup file /etc/grafana/ldap.toml

I've set '[unified_alerting] enabled = true' in the config file and all alert-rules have been successfully migrated and respective contact points, notification policies and silences showed up in the UI after i restarted the grafana service.
You can check out the migrated alerts in the new layout/UI here:
http://openqa-staging-1.qa.suse.de:3000/alerting/list
(Login via NIS account)

I was also able to revert the migration without problems as described here: https://grafana.com/docs/grafana/latest/alerting/migrating-alerts/roll-back/

Setting this ticket back to workable for now, as i'm in school next week

Actions #10

Updated by livdywan over 1 year ago

  • Due date deleted (2023-01-31)

Due date is not generally applicable to workable

Actions #11

Updated by mkittler over 1 year ago

When creating https://build.opensuse.org/request/show/1063633 I had to go through the lengthy list of changes to format them. I've noticed that there are several fixes and improvements regarding the migration of alerts to the new system. Maybe it is worthwhile to wait until the new version is deployed (currently my SR is even still pending) so we can benefit from these changes.

Actions #12

Updated by nicksinger over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #13

Updated by nicksinger over 1 year ago

Let's try with our current version (https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/792) and in case we require grafana 9 we can roll back according to @robert.richardson experiments

Actions #14

Updated by openqa_review over 1 year ago

  • Due date set to 2023-02-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by mkittler over 1 year ago

It is hard to tell which alerts are now silenced or not because the list on https://stats.openqa-monitor.qa.suse.de/alerting/silences is hard to make sense of. We should look into this. Note that the disk usage alert for the storage server can be resumed.

Actions #16

Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Resolved

We had a chance to talk about the new system on Friday with Jan and Marius and came to the conclusion that the Silences in https://stats.openqa-monitor.qa.suse.de/alerting/silences reflect our disabled alerts and special settings we had previously for metrics. E.g. https://stats.openqa-monitor.qa.suse.de/alerting/silence/af043d6e-7c17-4630-b7a0-9f618974a8e6/edit?alertmanager=grafana reflects that we disabled the "host up alert" for openqaworker-arm-1. The matches between silences and alerts can be done by searching the alerts by rule_uid like this: https://stats.openqa-monitor.qa.suse.de/alerting/list?queryString=rule_uid%3Df7G8-XA4k which is a bit too much work but works for now. Another thing we found out is that the silences show zero alerts and an empty "Affected alerts" list, if there is currently no alert firing. In general I think we fulfill all ACs here - for further improvements regarding the grouping/tagging I see https://progress.opensuse.org/issues/122848 as a suitable followup ticket. If you see other potential to improve I'd suggest to add it as subtask to our epic: https://progress.opensuse.org/issues/113674

Actions #17

Updated by nicksinger over 1 year ago

we also discussed the point "alerts work at least the way they did before" here: https://suse.slack.com/archives/C02AJ1E568M/p1675871242492409

Actions #18

Updated by mkittler about 1 year ago

  • Status changed from Resolved to Feedback

I'm afraid it doesn't work as before: #120267#note-28

Actions #19

Updated by okurz about 1 year ago

  • Subject changed from Migrate our Grafana setup to "unified alerting" size:M to Migrate our Grafana setup to "unified alerting"
  • Due date deleted (2023-02-23)
Actions #21

Updated by okurz about 1 year ago

  • Related to action #125303: prevent confusing "no data" alerts size:M added
Actions #22

Updated by okurz about 1 year ago

  • Priority changed from Normal to High

"High" due to #125303

Actions #23

Updated by okurz about 1 year ago

  • Due date set to 2023-03-15
Actions #24

Updated by mkittler about 1 year ago

You may resolve this ticket as the migration is done. However, please create a follow-up for #120267#note-28 then (storing alert config in JSON).

Actions #25

Updated by nicksinger about 1 year ago

  • Status changed from Feedback to Resolved
Actions #26

Updated by okurz about 1 year ago

  • Due date deleted (2023-03-15)
Actions

Also available in: Atom PDF