action #122845
openQA Project - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids
coordination #113674: [epic] Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M
Migrate our Grafana setup to "unified alerting"
0%
Description
Summary¶
With #112733 we got new I/O panels for the webui. Due to the nature of repeating panels we cannot add an alert for the IO time with the current alerting backend we use. This should be possible with unified alerting: https://grafana.com/blog/2021/06/14/the-new-unified-alerting-system-for-grafana-everything-you-need-to-know/
Acceptance criteria¶
- AC1: alerts work at least the way they did before
- AC2: the Grafana instance runs stable with unified alerting
Suggestions¶
- Find out how to migrate to the new system, automatically/ manually
- Try out with an official test instance of Grafana available from their website
- Test with a container
- Confirm what we end up with e.g. new JSON or different layout
- Keep in mind this is the default for Grafana 10 and our current setup may not be supportable long-term
Related issues
History
#1
Updated by cdywan 3 months ago
- Blocks action #122842: Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M added
#2
Updated by cdywan 3 months ago
- Blocks action #122848: Configure grouped alerts in Grafana correctly size:M added
#5
Updated by robert.richardson 2 months ago
- Assignee set to robert.richardson
#6
Updated by robert.richardson 2 months ago
- Status changed from Workable to In Progress
#7
Updated by openqa_review 2 months ago
- Due date set to 2023-01-31
Setting due date based on mean cycle time of SUSE QE Tools
#8
Updated by cdywan 2 months ago
We talked about it in the unblock. Robert's trying to setup Grafana locally, since we don't want to just upgrade in production. We could have that setup on a staging machine i.e. openqa-staging-1.qa.suse.de or openqa-staging-2.qa.suse.de, assuming the setup is basically the same effort and others can more easily check it and help out.
#9
Updated by robert.richardson 2 months ago
- Status changed from In Progress to Workable
- Assignee deleted (
robert.richardson)
The grafana instance on staging-1 is up, for the setup i've copied the grafana database /var/lib/grafana/grafana.db as well as the auth section of /etc/grafana/grafana.ini and the ldap setup file /etc/grafana/ldap.toml
I've set '[unified_alerting] enabled = true' in the config file and all alert-rules have been successfully migrated and respective contact points, notification policies and silences showed up in the UI after i restarted the grafana service.
You can check out the migrated alerts in the new layout/UI here:
http://openqa-staging-1.qa.suse.de:3000/alerting/list
(Login via NIS account)
I was also able to revert the migration without problems as described here: https://grafana.com/docs/grafana/latest/alerting/migrating-alerts/roll-back/
Setting this ticket back to workable for now, as i'm in school next week
#10
Updated by cdywan about 2 months ago
- Due date deleted (
2023-01-31)
Due date is not generally applicable to workable
#11
Updated by mkittler about 2 months ago
When creating https://build.opensuse.org/request/show/1063633 I had to go through the lengthy list of changes to format them. I've noticed that there are several fixes and improvements regarding the migration of alerts to the new system. Maybe it is worthwhile to wait until the new version is deployed (currently my SR is even still pending) so we can benefit from these changes.
#12
Updated by nicksinger about 2 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
#13
Updated by nicksinger about 2 months ago
Let's try with our current version (https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/792) and in case we require grafana 9 we can roll back according to robert.richardson experiments
#14
Updated by openqa_review about 2 months ago
- Due date set to 2023-02-23
Setting due date based on mean cycle time of SUSE QE Tools
#15
Updated by mkittler about 1 month ago
It is hard to tell which alerts are now silenced or not because the list on https://stats.openqa-monitor.qa.suse.de/alerting/silences is hard to make sense of. We should look into this. Note that the disk usage alert for the storage server can be resumed.
#16
Updated by nicksinger about 1 month ago
- Status changed from In Progress to Resolved
We had a chance to talk about the new system on Friday with Jan and Marius and came to the conclusion that the Silences in https://stats.openqa-monitor.qa.suse.de/alerting/silences reflect our disabled alerts and special settings we had previously for metrics. E.g. https://stats.openqa-monitor.qa.suse.de/alerting/silence/af043d6e-7c17-4630-b7a0-9f618974a8e6/edit?alertmanager=grafana reflects that we disabled the "host up alert" for openqaworker-arm-1. The matches between silences and alerts can be done by searching the alerts by rule_uid like this: https://stats.openqa-monitor.qa.suse.de/alerting/list?queryString=rule_uid%3Df7G8-XA4k which is a bit too much work but works for now. Another thing we found out is that the silences show zero alerts and an empty "Affected alerts" list, if there is currently no alert firing. In general I think we fulfill all ACs here - for further improvements regarding the grouping/tagging I see https://progress.opensuse.org/issues/122848 as a suitable followup ticket. If you see other potential to improve I'd suggest to add it as subtask to our epic: https://progress.opensuse.org/issues/113674
#17
Updated by nicksinger about 1 month ago
we also discussed the point "alerts work at least the way they did before" here: https://suse.slack.com/archives/C02AJ1E568M/p1675871242492409
#20
Updated by okurz 25 days ago
Then I suggest to look into https://www.ecosia.org/search?q=grafana%20unified%20alerting%20deployed%20from%20files or similar
#21
Updated by okurz 20 days ago
- Related to action #125303: prevent confusing "no data" alerts size:M added
#25
Updated by nicksinger 18 days ago
- Status changed from Feedback to Resolved