Project

General

Profile

Actions

action #125642

closed

openQA Project - coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

coordination #113674: [epic] Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:M

Manage "unified alerting" via salt size:M

Added by nicksinger over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-01-09
Due date:
% Done:

0%

Estimated time:

Description

Summary

Since the switch to unified alerting, alerts are not linked to panels any longer and therefore are not managed by our salt any longer. We should make sure to get them managed somehow.

Acceptance criteria

  • AC1: Alerts are managed by salt

Suggestions


Related issues 1 (0 open1 closed)

Blocks openQA Infrastructure - action #125303: prevent confusing "no data" alerts size:MResolvednicksinger2023-03-022023-04-07

Actions
Actions #1

Updated by okurz over 1 year ago

  • Tags set to infra, salt, grafana, alerts
  • Due date deleted (2023-03-15)
  • Priority changed from Normal to High
Actions #2

Updated by mkittler over 1 year ago

  • Subject changed from Manage "unified alerting" via salt to Manage "unified alerting" via salt size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by nicksinger over 1 year ago

I found some hint in a complain in the community: https://community.grafana.com/t/ngalert-grafana-8-alert-feature-how-to-export-import-alerts-as-yml-json/51677/26
which points to https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/ where a similar approach like with dashboards is described. But it requires Grafana 9.1 apparently so we need to think how to continue here

Actions #5

Updated by nicksinger over 1 year ago

  • Blocks action #125303: prevent confusing "no data" alerts size:M added
Actions #6

Updated by nicksinger over 1 year ago

  • Assignee set to osukup

@osukup could you please update us after Monday if mcaj did update the monitoring repository so the grafana 9 package can be build for leap15.4?

Actions #7

Updated by nicksinger over 1 year ago

  • Status changed from Workable to Feedback
Actions #8

Updated by osukup over 1 year ago

  • Status changed from Feedback to Workable
  • Assignee changed from osukup to nicksinger

after short talk with @mcaj, meta of project was updated to build 15.4 and 15.5 repositories against Update .. so with Requires: go >= 1.19 grafana-9 now should be buildable in project

Actions #9

Updated by okurz over 1 year ago

  • Assignee deleted (nicksinger)

Discussed in daily 2023-03-15 and we found that https://monitor.qa.suse.de was automatically upgraded to grafana 9.3.6 and we rolled back the alerting for now, see #125303. So we can experiment with the new version, e.g. either locally in a container/VM, on an openQA staging instance like openqa-staging-1.qa.suse.de or in production

Actions #10

Updated by mkittler over 1 year ago

  • Assignee set to mkittler

Looks like we currently provision dashboards via files on disk. I suppose therefore it makes sense to do the same for alerts which means following option 1 from https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources (https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/file-provisioning).

Actions #11

Updated by mkittler over 1 year ago

  • Status changed from Workable to In Progress
Actions #12

Updated by mkittler over 1 year ago

The following should work according to my local testings:

  1. Enable unified alerting again
  2. Export the alerts as YAML (e.g. via the web UI)
    • This will not cover silences. I don't see a way to export them. (I've checked the documentation on it.) Likely this is ok, though.
  3. Delete all migrated "manual" alerts again, the Grafana documentation also explicitly states that: "If you do not delete the alert rule, it will clash with the provisioned alert rule once uploaded."
  4. Put the YAML file under /etc/grafana/provisioning/alerting/ via Salt.
    • One needs to unsubstitue concrete values with placeholders and fiddle things into our Salt repo's structure. That will be the hard part.
  5. Restart the service.
  6. Alerts should show up again as "provisioned".

I have added all dashboard we have on the monitoring host locally by these simple commands:

rsync -aHP monitor.qa.suse.de:/var/lib/grafana/dashboards/ /var/lib/grafana/dashboards/
rsync -aHP monitor.qa.suse.de:/etc/grafana/provisioning/ /etc/grafana/provisioning/
sudo systemctl restart grafana-server.service

So I have all dashboards available locally. I'm going to conduct the steps locally (after explicitly enabling legacy alerting). If I'm pulling in the latest config from production it might even sense to conduct step 4. locally and prepare a PR for production to enable unified alerting that also immediately contains the provisioning (based on my local export).

Actions #13

Updated by okurz over 1 year ago

mkittler wrote:

...

  1. Export the alerts as YAML (e.g. via the web UI)
    • This will not cover silences. I don't see a way to export them. (I've checked the documentation on it.) Likely this is ok, though.

Yes, I think this is ok. In the case we need to reinstall and don't have/use a database backup we would just need to manually silence the exceptions, that should be ok.

If I'm pulling in the latest config from production it might even sense to conduct step 4. locally and prepare a PR for production to enable unified alerting that also immediately contains the provisioning (based on my local export).

That would be great! In the end we want to be able to replicate the production setup so you could also deploy on a staging instance for all to test that

Actions #14

Updated by openqa_review over 1 year ago

  • Due date set to 2023-03-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by nicksinger over 1 year ago

Just a small hint: there is an API endpoint to get the alerts json representation available: https://grafana.com/docs/grafana/latest/developers/http_api/alerting_provisioning/#route-get-alert-rule-export
So you might be able to quickly get every rule with a simple bash/curl script and do not need to click a button for every alert in the webui :)

Actions #16

Updated by mkittler over 1 year ago

and do not need to click a button for every alert in the webui

I know but there's one button to export them all.


Unfortunately importing dashboards from production while still having the legacy alerting enabled didn't work as well as expected. Most dashboards fail to import due to "data source not found". I actually have added InfluxDB as datasource locally in the same way as we have it in production. So I'm not sure what's missing as I also couldn't find a more specific error message. I've also gotten no alerts at all. The dashboards that could be imported have the alert panel but alerts are not showing up under the alerts page. Any ideas what I'm missing? I've also tried it twice (wiping the Grafana db). I find it very weird that some dashboard even show up. The ones that don't show up don't use different data sources.

We could also skip the local part and export the migrated alerts directly from production.

Actions #17

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

I suggest we could proceed with that ticket in a pair-programming session, e.g. after the infra meeting. I suppose I could benefit a lot from doing it together with Nick or Oliver. Maybe you have better ideas for importing production data locally or we just enable unified alerting in production again and continue with the already mentioned steps from there.

Actions #18

Updated by livdywan over 1 year ago

Apparently Robert copied the db, see https://progress.opensuse.org/issues/122845#note-9 - alternatively we seem to be fine with your doing it in production / in a pair-programming session.

Actions #19

Updated by mkittler over 1 year ago

I could also try copying the DB. I suppose I have even suggested that to him back then :-)

I've just haven't thought about it because I thought he'd eventually just moved a single dashboard over. (Can't check that right now because thincsus.qa.suse.de is down. Well, it appears to be thincsus.qe.nue2.suse.org now but its setup is quite broken.)

Actions #20

Updated by mkittler over 1 year ago

  • Status changed from Feedback to In Progress

Ok, so copying the DB and enabling ldap on my local instance made it work. I could export everything. The file is over 30000 lines long but Kate and Vim handle it well. I couldn't find a way to delete all alerts so I've just kept all alerts there, moved the exported YAML to the provisioning directory and restarted the service. It seems that not deleting the alerts before is not a big deal. In fact, I've ended up with the same number of alerts than before (no duplicates) and all have the "provisioned" label. So we can just skip the deletion part.

Maybe it makes still sense to enable unified alerting in production first and export stuff from there - just in case my local version misses something. Before that, I'll prepare how the salt states change would look like.

Actions #21

Updated by mkittler over 1 year ago

Actions #22

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback
Actions #23

Updated by mkittler over 1 year ago

  • Status changed from Feedback to In Progress

We're back using unified alerting and my MR for the provisioning has been merged.

Just to recap, it had been created using the following logic to split the exported file and discard certain alerts:

my %new_groups_by_dashboard;
my $groups = $yaml->{groups};
for my $group (@$groups) {
  # get rid of all alerts that haven't been provisioned so far
  next unless $group->{folder} eq 'Salt';

  # create a separate set of groups per dashboard
  die "$group->{name} has no rules\n" unless my $rules = $group->{rules};
  die "$group->{name} has not exactly one rule\n" unless @$rules == 1;
  die "$group->{name} has not dashboardUid\n" unless my $dashboard_uid = $rules->[0]->{dashboardUid};
  my $groups = $new_groups_by_dashboard{$dashboard_uid} //= [];
  push @$groups, $group;
}

my $output_dir = path("$ENV{OPENQA_BASEDIR}/repos/salt-states-openqa/monitoring/grafana/alerting");
$output_dir->make_path->remove_tree({keep_root => 1});
for my $key (keys %new_groups_by_dashboard) {
  my $new_yaml = {apiVersion => 1, groups => $new_groups_by_dashboard{$key}};
  my $output_file = path($output_dir, "dashboard-$key.yaml");
  $output_file->spurt($dump_yaml->($new_yaml));
}

Today I've restarted Grafana and unlike what I've experienced with Grafana 9.4 locally the conflict with existing alerts was a problem:

Mar 23 10:51:09 openqa-monitor grafana-server[7603]: Failed to start grafana. error: alert rules: a conflicting alert rule is found: rule title under the same organisation and folder should be unique
Mar 23 10:51:09 openqa-monitor grafana-server[7603]: alert rules: a conflicting alert rule is found: rule title under the same organisation and folder should be unique

So I've deleted the relevant 320 alerts manually via the web UI, restarted the service and now all alerts in the "Salt" show up as "provisioned".


So the first step was successful. Now we need to re-templatize the alerts:

The following files should be generated from a worker-specific template:

find -iname '*WD*' | sort
./dashboard-WDgrenache-1.yaml
./dashboard-WDmalbec.yaml
./dashboard-WDopenqaworker14.yaml
./dashboard-WDopenqaworker16.yaml
./dashboard-WDopenqaworker17.yaml
./dashboard-WDopenqaworker18.yaml
./dashboard-WDopenqaworker-arm-1.yaml
./dashboard-WDopenqaworker-arm-2.yaml
./dashboard-WDopenqaworker-arm-3.yaml
./dashboard-WDopenqaworker-arm-4.yaml
./dashboard-WDopenqaworker-arm-5.yaml
./dashboard-WDpowerqaworker-qam-1.yaml
./dashboard-WDQA-Power8-4-kvm.yaml
./dashboard-WDQA-Power8-5-kvm.yaml
./dashboard-WDworker10.yaml
./dashboard-WDworker11.yaml
./dashboard-WDworker12.yaml
./dashboard-WDworker13.yaml
./dashboard-WDworker2.yaml
./dashboard-WDworker3.yaml
./dashboard-WDworker5.yaml
./dashboard-WDworker6.yaml
./dashboard-WDworker8.yaml
./dashboard-WDworker9.yaml

The following files should be generated from a template for generic hosts:

find -iname '*GD*' | sort
./dashboard-GDbackup-vm.yaml
./dashboard-GDbaremetal-support.yaml
./dashboard-GDjenkins.yaml
./dashboard-GDopenqa-piworker.yaml
./dashboard-GDopenqaw5-xen.yaml
./dashboard-GDqamasternue.yaml
./dashboard-GDschort-server.yaml
./dashboard-GDstorage.yaml
./dashboard-GDtumblesle.yaml

The remaining files can stay as they are.

This corresponds to the split/templating we have for the legacy alert data.

Actions #24

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

MR for making generic/worker alerts templates: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/816

I haven't tested it yet but maybe we just merge it and find out how well it works. I'm also still not sure how to deal with UIDs, see https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/816#note_499823.

Looks like I'll also have to take care of alerts that originate from webui.services.json and certificates.json but that would be the next step.


I've noticed a problem with the file provisioning and unified alerting: One cannot edit alerts anymore via the web UI at all. The dashboard provisioning and legacy alerting allows to edit everything in the web UI and upon saving one gets a prompt to C&P the JSON. That's apparently not how it works for the new alerting. On the upside we can now easier make changed directly within YAML which is especially for the templated alerts very useful. However, sometimes it would be nice to play around in the web UI. Now one can only play around with the query in the query explorer anymore. I'm not sure how one would export a query from the query explorer as JSON (and not raw string), e.g. to update that query in one of our alert YAML files.

If we'd used the REST-API to update alerts we could use the x-disable-provenance header. Supposedly the alerts wouldn't then not be considered "Provisioned" anymore. Then the corresponding label in the web UI would disappear and the edit button reappear. I don't think there'd be a prompt containing the JSON/YAML upon saving. So we'd somehow needed to export the changes back manually. Hence I'm not sure whether using the REST-API would be a big win.


Note that currently we either needed to restart the Grfana service manually or call a special API route manually in order to make changes to provisioned alerts effective (see note in the official documentation). I suppose we should automate this. However, the other problems I've mentioned are likely something to discuss before.


Overall, this is not very nice. We will likely have to re-setup silences after merging the templating MR. We will likely loose the ability to edit alerts as conveniently via the web UI as it was possible before. Unfortunately, the legacy alerting is going to be removed in Grafana 10 so there's likely not way around this (except staying on Grafana < 10 forever).

Actions #25

Updated by mkittler over 1 year ago

We've now configured alert notifications so only Nick and me get e-mails. Let's see how well it works now. We did the following improvements:

  1. Removed all silences generated by the migration as they were not effective anymore after https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/816 changed most of the UIDs. That means we can start from scratch adding only the silences we need.
  2. Following-up on 1. we have added a silence for the pi-worker alerts as this host simply doesn't seem to be stable enough yet. This already shows that the new alerting let's us create silences in a much more generic way than what was generated by the migration.
  3. We deleted all alerts the migration created for non-provisioned dashboards as we most likely don't want any alerts for these.
  4. We deleted one alert that was not working anyways (https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/819) and caused no-data-notifications.
Actions #26

Updated by mkittler over 1 year ago

There were more mails again. I suppose we need:

  • A silence for no-data alerts of hosts that are known to be unstable and have automatic recovery actions triggered.
  • Generally lower the threshold until a no-data alert is triggered. However, maybe this is also just a symptom of putting the alerts into different folders as the error mentioned in the corresponding mails is just "Error: failed to build query 'A': database is locked". I've seen these kinds of errors before when mass-importing the newly provisioned alerts and it resolved itself after a few minutes.
Actions #27

Updated by nicksinger over 1 year ago

mkittler wrote:

There were more mails again. I suppose we need:

  • A silence for no-data alerts of hosts that are known to be unstable and have automatic recovery actions triggered.
  • Generally lower the threshold until a no-data alert is triggered. However, maybe this is also just a symptom of putting the alerts into different folders as the error mentioned in the corresponding mails is just "Error: failed to build query 'A': database is locked". I've seen these kinds of errors before when mass-importing the newly provisioned alerts and it resolved itself after a few minutes.

There seems to be an open bug in grafana regarding the locked database: https://github.com/grafana/grafana/issues/16638
I therefore tried the workaround mentioned in https://github.com/grafana/grafana/issues/16638#issuecomment-1417371248 and issued sqlite3 grafana.db "pragma journal_mode=wal;" on monitor.qa.suse.de - lets see if that helps.

Actions #28

Updated by mkittler over 1 year ago

Yes, let's see whether it works. This reminds me of our Minion setup on workers where setting the DB to WAL mode made things even worse. However, normally it should be a good choice, indeed.

I've just been adding a silence for the first point in my last comment.

Not sure about alerts like "Firing: 3 alerts for alertname=DatasourceNoData grafana_folder=Salt" for rules like "rulename: Disk I/O time for /dev/vda (/) alert", "rulename: Job age (scheduled) (max) alert", …. Those are about the web UI and it shouldn't be considered unstable. Maybe we need to suppress such no data alerts after all (which would be in-line with our previous setting in the old alerting)?

Actions #29

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

After rebooting openqaworker13 we have received only received notifications for the "Disk I/O time alert". This is a manageable amount of mails. So although I believe that this alert does not fall under #122842 I have changed the notification channel back to normal. I will leave a message in the team chat about it. I have also just created #126962 for using templating consistently.

Actions #30

Updated by okurz over 1 year ago

  • Due date deleted (2023-03-31)
Actions

Also available in: Atom PDF