action #122845
closedopenQA Project - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids
coordination #113674: [epic] Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M
Migrate our Grafana setup to "unified alerting"
0%
Description
Summary¶
With #112733 we got new I/O panels for the webui. Due to the nature of repeating panels we cannot add an alert for the IO time with the current alerting backend we use. This should be possible with unified alerting: https://grafana.com/blog/2021/06/14/the-new-unified-alerting-system-for-grafana-everything-you-need-to-know/
Acceptance criteria¶
- AC1: alerts work at least the way they did before
- AC2: the Grafana instance runs stable with unified alerting
Suggestions¶
- Find out how to migrate to the new system, automatically/ manually
- Try out with an official test instance of Grafana available from their website
- Test with a container
- Confirm what we end up with e.g. new JSON or different layout
- Keep in mind this is the default for Grafana 10 and our current setup may not be supportable long-term
Updated by livdywan almost 2 years ago
- Blocks action #122842: Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M added
Updated by livdywan almost 2 years ago
- Blocks action #122848: Configure grouped alerts in Grafana correctly size:M added
Updated by robert.richardson over 1 year ago
- Status changed from Workable to In Progress
Updated by openqa_review over 1 year ago
- Due date set to 2023-01-31
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 1 year ago
We talked about it in the unblock. Robert's trying to setup Grafana locally, since we don't want to just upgrade in production. We could have that setup on a staging machine i.e. openqa-staging-1.qa.suse.de or openqa-staging-2.qa.suse.de, assuming the setup is basically the same effort and others can more easily check it and help out.
Updated by robert.richardson over 1 year ago
- Status changed from In Progress to Workable
- Assignee deleted (
robert.richardson)
The grafana instance on staging-1 is up, for the setup i've copied the grafana database /var/lib/grafana/grafana.db as well as the auth section of /etc/grafana/grafana.ini and the ldap setup file /etc/grafana/ldap.toml
I've set '[unified_alerting] enabled = true' in the config file and all alert-rules have been successfully migrated and respective contact points, notification policies and silences showed up in the UI after i restarted the grafana service.
You can check out the migrated alerts in the new layout/UI here:
http://openqa-staging-1.qa.suse.de:3000/alerting/list
(Login via NIS account)
I was also able to revert the migration without problems as described here: https://grafana.com/docs/grafana/latest/alerting/migrating-alerts/roll-back/
Setting this ticket back to workable for now, as i'm in school next week
Updated by livdywan over 1 year ago
- Due date deleted (
2023-01-31)
Due date is not generally applicable to workable
Updated by mkittler over 1 year ago
When creating https://build.opensuse.org/request/show/1063633 I had to go through the lengthy list of changes to format them. I've noticed that there are several fixes and improvements regarding the migration of alerts to the new system. Maybe it is worthwhile to wait until the new version is deployed (currently my SR is even still pending) so we can benefit from these changes.
Updated by nicksinger over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by nicksinger over 1 year ago
Let's try with our current version (https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/792) and in case we require grafana 9 we can roll back according to @robert.richardson experiments
Updated by openqa_review over 1 year ago
- Due date set to 2023-02-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
It is hard to tell which alerts are now silenced or not because the list on https://stats.openqa-monitor.qa.suse.de/alerting/silences is hard to make sense of. We should look into this. Note that the disk usage alert for the storage server can be resumed.
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Resolved
We had a chance to talk about the new system on Friday with Jan and Marius and came to the conclusion that the Silences in https://stats.openqa-monitor.qa.suse.de/alerting/silences reflect our disabled alerts and special settings we had previously for metrics. E.g. https://stats.openqa-monitor.qa.suse.de/alerting/silence/af043d6e-7c17-4630-b7a0-9f618974a8e6/edit?alertmanager=grafana reflects that we disabled the "host up alert" for openqaworker-arm-1. The matches between silences and alerts can be done by searching the alerts by rule_uid like this: https://stats.openqa-monitor.qa.suse.de/alerting/list?queryString=rule_uid%3Df7G8-XA4k which is a bit too much work but works for now. Another thing we found out is that the silences show zero alerts and an empty "Affected alerts" list, if there is currently no alert firing. In general I think we fulfill all ACs here - for further improvements regarding the grouping/tagging I see https://progress.opensuse.org/issues/122848 as a suitable followup ticket. If you see other potential to improve I'd suggest to add it as subtask to our epic: https://progress.opensuse.org/issues/113674
Updated by nicksinger over 1 year ago
we also discussed the point "alerts work at least the way they did before" here: https://suse.slack.com/archives/C02AJ1E568M/p1675871242492409
Updated by mkittler over 1 year ago
- Status changed from Resolved to Feedback
I'm afraid it doesn't work as before: #120267#note-28
Updated by okurz over 1 year ago
- Subject changed from Migrate our Grafana setup to "unified alerting" size:M to Migrate our Grafana setup to "unified alerting"
- Due date deleted (
2023-02-23)
Updated by okurz over 1 year ago
Then I suggest to look into https://www.ecosia.org/search?q=grafana%20unified%20alerting%20deployed%20from%20files or similar
Updated by okurz over 1 year ago
- Related to action #125303: prevent confusing "no data" alerts size:M added
Updated by mkittler over 1 year ago
You may resolve this ticket as the migration is done. However, please create a follow-up for #120267#note-28 then (storing alert config in JSON).
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved