Project

General

Profile

Actions

action #154627

closed

[potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #150983 and #151588. Currently our "host up" alert is likely showing "no data" for currently salt-controlled hosts that are temporarily down but that needs to be crosschecked.

Acceptance criteria

  • AC1: We are alerted if a host that is currently in salt is down
  • AC2: There is only one firing alert at a time when a host that is currently in salt is down
  • AC3: There is no firing alert after reasonable time if we have removed a host from salt control, i.e. removed from salt keys on OSD and potentially re-deploy a high state

Suggestions

  • On monitor.qa.suse.de select any host, show the "host up" panel, then shut down the machine and check how the ping behaves, e.g. select tumblesle on qamaster
  • Fix the alert or if everything works fine convince everybody that made big noiz about nothing
  • Extend our documentation in salt-states repo or team wiki or openQA wiki as applicable for how to handle taking hosts down/up or something, e.g. review https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:MRejectedokurz2023-11-28

Actions
Actions #1

Updated by okurz 3 months ago

  • Copied from action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M added
Actions #2

Updated by nicksinger 3 months ago

  • Assignee set to nicksinger
Actions #3

Updated by nicksinger 3 months ago

  • Status changed from Workable to In Progress

I dug trough our current process and deployment of provisioned alerts and started with understanding why we still have provisioned alerts for e.g. ada. It turns out that we have again stale provisioned alerts which is not optimal. The files however are not present any longer so this part of our salt deployment apparently works. While reading online about the whole provisioning situation I stumbled over some API endpoints. These can be used to list all provisioned alerts and can be accessed via curl -s 'https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules' -u "username:password". As expected I found some old alerts for ada in there.

Further searching revealed a special mechanism in grafana to delete such provisioned alerts explicitly by creating a special yaml-file in the same folder to provision alerts which looks like this:

deleteRules:
  - orgId: 1
    uid: my_id_1

after reloading grafana the stale alert is indeed gone (at least from the webui and api, I hope it doesn't trigger mails). Unfortunately grafana also does not cleanup or track this file in any way. So if in the future we would introduce an alert with the same uid, this would create and instantly delete it again on startup. To solve this, I came up with the following procedure to clean our alerts automatically:

  1. list all alerts with UIDs (curl -s 'https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules' -u "username:password" | jq -r '.[].uid')
  2. grep for UID in the existing provisioned alert files (all files present on monitor.qa.suse.de in /etc/grafana/provisioning/alerting)
  3. if UID is not present, add it to that special "deletion file"
  4. reload grafana to get rid of these stale entries
  5. remove the special deletion file again to avoid deletion if the same UID gets added later
  6. reload grafana once again

Currently we restart the whole grafana service to reprovision our dashboards and alerts. This is not the cleanest solution as it makes everything unavailable and might reset timers of alerts and such. Especially if it has to be done twice in rapid succession. I looked into API calls to reload the provisioned alerts and found an corresponding endpoint. Unfortunately it doesn't seem to be usable with "service accounts" (API key replacements since Grafana version 9) and always yielded "missing permission". I searched further and found a big lack of documentation and only a lot of hints to "Role-based access control" which apparently is somewhat implemented but only fully usable in the enterprise and cloud version of grafana.

My final straw was checking out the "admin" user on our grafana instance. I had no password and the email for it was invalid so I went into the sqlite database and replace it by "osd-admins@suse.de" and requested a password reset. With that user I was actually able to elevate more users on our instance to "instance admins" which apparently gives even more permissions (e.g. create new organizations). With that permission it is also possible to reload the provisioned alerts by using the following curl call:

curl -X POST -H "Content-Type: application/json" -s 'https://stats.openqa-monitor.qa.suse.de/api/admin/provisioning/alerting/reload' -u "instance_admin:password"

I now have all the building blocks in place to automate this with our salt deployment which I will try to realize next.

Actions #4

Updated by openqa_review 3 months ago

  • Due date set to 2024-02-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz 3 months ago

  • Description updated (diff)
Actions #6

Updated by okurz 3 months ago

Adapted ACs as discussed. nsinger powered down tumblesle and monitoring turned to "pending" so that covers AC1. Please put #154627-3 into a script that can be executed and reference in https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production

Actions #7

Updated by nicksinger 3 months ago

  • Status changed from In Progress to Feedback

Script added with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1107 and wiki adjusted to reference it: https://progress.opensuse.org/projects/openqav3/wiki/Wiki/diff?utf8=%E2%9C%93&commit=View+differences&version=290&version_from=289 - this should cover AC3 if the MR gets merged.

AC2 is fulfilled as well because for tumblesle we only got a single mail "[FIRING:1] host_up (tumblesle: host up alert Generic tumblesle host_up_alert_tumblesle generic)" which is what we expect and have configured. I started the VM once again which resolved the alert.

Actions #8

Updated by okurz 3 months ago

  • Due date deleted (2024-02-18)
  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF