I dug trough our current process and deployment of provisioned alerts and started with understanding why we still have provisioned alerts for e.g. ada. It turns out that we have again stale provisioned alerts which is not optimal. The files however are not present any longer so this part of our salt deployment apparently works. While reading online about the whole provisioning situation I stumbled over some API endpoints. These can be used to list all provisioned alerts and can be accessed via curl -s 'https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules' -u "username:password"
. As expected I found some old alerts for ada in there.
Further searching revealed a special mechanism in grafana to delete such provisioned alerts explicitly by creating a special yaml-file in the same folder to provision alerts which looks like this:
deleteRules:
- orgId: 1
uid: my_id_1
after reloading grafana the stale alert is indeed gone (at least from the webui and api, I hope it doesn't trigger mails). Unfortunately grafana also does not cleanup or track this file in any way. So if in the future we would introduce an alert with the same uid, this would create and instantly delete it again on startup. To solve this, I came up with the following procedure to clean our alerts automatically:
- list all alerts with UIDs (
curl -s 'https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules' -u "username:password" | jq -r '.[].uid'
)
- grep for UID in the existing provisioned alert files (all files present on
monitor.qa.suse.de
in /etc/grafana/provisioning/alerting
)
- if UID is not present, add it to that special "deletion file"
- reload grafana to get rid of these stale entries
- remove the special deletion file again to avoid deletion if the same UID gets added later
- reload grafana once again
Currently we restart the whole grafana service to reprovision our dashboards and alerts. This is not the cleanest solution as it makes everything unavailable and might reset timers of alerts and such. Especially if it has to be done twice in rapid succession. I looked into API calls to reload the provisioned alerts and found an corresponding endpoint. Unfortunately it doesn't seem to be usable with "service accounts" (API key replacements since Grafana version 9) and always yielded "missing permission". I searched further and found a big lack of documentation and only a lot of hints to "Role-based access control" which apparently is somewhat implemented but only fully usable in the enterprise and cloud version of grafana.
My final straw was checking out the "admin" user on our grafana instance. I had no password and the email for it was invalid so I went into the sqlite database and replace it by "osd-admins@suse.de" and requested a password reset. With that user I was actually able to elevate more users on our instance to "instance admins" which apparently gives even more permissions (e.g. create new organizations). With that permission it is also possible to reload the provisioned alerts by using the following curl call:
curl -X POST -H "Content-Type: application/json" -s 'https://stats.openqa-monitor.qa.suse.de/api/admin/provisioning/alerting/reload' -u "instance_admin:password"
I now have all the building blocks in place to automate this with our salt deployment which I will try to realize next.