action #154627: [potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #154627

closed

[potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

alert, monitoring, infra, reactive work

Description

Motivation¶

See #150983 and #151588. Currently our "host up" alert is likely showing "no data" for currently salt-controlled hosts that are temporarily down but that needs to be crosschecked.

Acceptance criteria¶

AC1: We are alerted if a host that is currently in salt is down
AC2: There is only one firing alert at a time when a host that is currently in salt is down
AC3: There is no firing alert after reasonable time if we have removed a host from salt control, i.e. removed from salt keys on OSD and potentially re-deploy a high state

Suggestions¶

On monitor.qa.suse.de select any host, show the "host up" panel, then shut down the machine and check how the ping behaves, e.g. select tumblesle on qamaster
Fix the alert or if everything works fine convince everybody that made big noiz about nothing
Extend our documentation in salt-states repo or team wiki or openQA wiki as applicable for how to handle taking hosts down/up or something, e.g. review https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Copied from action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M added

Actions

Copy link

Updated by nicksinger about 1 year ago

Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger about 1 year ago

Status changed from Workable to In Progress

I dug trough our current process and deployment of provisioned alerts and started with understanding why we still have provisioned alerts for e.g. ada. It turns out that we have again stale provisioned alerts which is not optimal. The files however are not present any longer so this part of our salt deployment apparently works. While reading online about the whole provisioning situation I stumbled over some API endpoints. These can be used to list all provisioned alerts and can be accessed via curl -s 'https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules' -u "username:password". As expected I found some old alerts for ada in there.

Further searching revealed a special mechanism in grafana to delete such provisioned alerts explicitly by creating a special yaml-file in the same folder to provision alerts which looks like this:

deleteRules:
  - orgId: 1
    uid: my_id_1

after reloading grafana the stale alert is indeed gone (at least from the webui and api, I hope it doesn't trigger mails). Unfortunately grafana also does not cleanup or track this file in any way. So if in the future we would introduce an alert with the same uid, this would create and instantly delete it again on startup. To solve this, I came up with the following procedure to clean our alerts automatically:

list all alerts with UIDs (curl -s 'https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules' -u "username:password" | jq -r '.[].uid')
grep for UID in the existing provisioned alert files (all files present on monitor.qa.suse.de in /etc/grafana/provisioning/alerting)
if UID is not present, add it to that special "deletion file"
reload grafana to get rid of these stale entries
remove the special deletion file again to avoid deletion if the same UID gets added later
reload grafana once again

Currently we restart the whole grafana service to reprovision our dashboards and alerts. This is not the cleanest solution as it makes everything unavailable and might reset timers of alerts and such. Especially if it has to be done twice in rapid succession. I looked into API calls to reload the provisioned alerts and found an corresponding endpoint. Unfortunately it doesn't seem to be usable with "service accounts" (API key replacements since Grafana version 9) and always yielded "missing permission". I searched further and found a big lack of documentation and only a lot of hints to "Role-based access control" which apparently is somewhat implemented but only fully usable in the enterprise and cloud version of grafana.

My final straw was checking out the "admin" user on our grafana instance. I had no password and the email for it was invalid so I went into the sqlite database and replace it by "osd-admins@suse.de" and requested a password reset. With that user I was actually able to elevate more users on our instance to "instance admins" which apparently gives even more permissions (e.g. create new organizations). With that permission it is also possible to reload the provisioned alerts by using the following curl call:

curl -X POST -H "Content-Type: application/json" -s 'https://stats.openqa-monitor.qa.suse.de/api/admin/provisioning/alerting/reload' -u "instance_admin:password"

I now have all the building blocks in place to automate this with our salt deployment which I will try to realize next.

Actions

Copy link

Updated by openqa_review about 1 year ago

Due date set to 2024-02-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz about 1 year ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 1 year ago

Adapted ACs as discussed. nsinger powered down tumblesle and monitoring turned to "pending" so that covers AC1. Please put #154627-3 into a script that can be executed and reference in https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production

Actions

Copy link

Updated by nicksinger about 1 year ago

Status changed from In Progress to Feedback

Script added with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1107 and wiki adjusted to reference it: https://progress.opensuse.org/projects/openqav3/wiki/Wiki/diff?utf8=%E2%9C%93&commit=View+differences&version=290&version_from=289 - this should cover AC3 if the MR gets merged.

AC2 is fulfilled as well because for tumblesle we only got a single mail "[FIRING:1] host_up (tumblesle: host up alert Generic tumblesle host_up_alert_tumblesle generic)" which is what we expect and have configured. I started the VM once again which resolved the alert.

Actions

Copy link

Updated by okurz about 1 year ago

Due date deleted (~~2024-02-18~~)
Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1107 merged. Very nice!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #154627

[potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz about 1 year ago

Updated by nicksinger about 1 year ago

Updated by nicksinger about 1 year ago

Updated by openqa_review about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by nicksinger about 1 year ago

Updated by okurz about 1 year ago