action #154627: [potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M - openQA Infrastructure - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #154627

closed

[potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M

Added by okurz 6 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

alert, monitoring, infra, reactive work

Description

Motivation¶

See #150983 and #151588. Currently our "host up" alert is likely showing "no data" for currently salt-controlled hosts that are temporarily down but that needs to be crosschecked.

Acceptance criteria¶

AC1: We are alerted if a host that is currently in salt is down
AC2: There is only one firing alert at a time when a host that is currently in salt is down
AC3: There is no firing alert after reasonable time if we have removed a host from salt control, i.e. removed from salt keys on OSD and potentially re-deploy a high state

Suggestions¶

On monitor.qa.suse.de select any host, show the "host up" panel, then shut down the machine and check how the ping behaves, e.g. select tumblesle on qamaster
Fix the alert or if everything works fine convince everybody that made big noiz about nothing
Extend our documentation in salt-states repo or team wiki or openQA wiki as applicable for how to handle taking hosts down/up or something, e.g. review https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production

Related issues 1 (0 open — 1 closed)

Copied from openQA Infrastructure - action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M

Rejected

okurz

2023-11-28

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz 6 months ago

Copied from action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M added

Actions

Copy link

Updated by nicksinger 6 months ago

Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger 6 months ago

Status changed from Workable to In Progress

I dug trough our current process and deployment of provisioned alerts and started with understanding why we still have provisioned alerts for e.g. ada. It turns out that we have again stale provisioned alerts which is not optimal. The files however are not present any longer so this part of our salt deployment apparently works. While reading online about the whole provisioning situation I stumbled over some API endpoints. These can be used to list all provisioned alerts and can be accessed via curl -s 'https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules' -u "username:password". As expected I found some old alerts for ada in there.

Further searching revealed a special mechanism in grafana to delete such provisioned alerts explicitly by creating a special yaml-file in the same folder to provision alerts which looks like this:

deleteRules:
  - orgId: 1
    uid: my_id_1

after reloading grafana the stale alert is indeed gone (at least from the webui and api, I hope it doesn't trigger mails). Unfortunately grafana also does not cleanup or track this file in any way. So if in the future we would introduce an alert with the same uid, this would create and instantly delete it again on startup. To solve this, I came up with the following procedure to clean our alerts automatically:

list all alerts with UIDs (curl -s 'https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules' -u "username:password" | jq -r '.[].uid')
grep for UID in the existing provisioned alert files (all files present on monitor.qa.suse.de in /etc/grafana/provisioning/alerting)
if UID is not present, add it to that special "deletion file"
reload grafana to get rid of these stale entries
remove the special deletion file again to avoid deletion if the same UID gets added later
reload grafana once again

Currently we restart the whole grafana service to reprovision our dashboards and alerts. This is not the cleanest solution as it makes everything unavailable and might reset timers of alerts and such. Especially if it has to be done twice in rapid succession. I looked into API calls to reload the provisioned alerts and found an corresponding endpoint. Unfortunately it doesn't seem to be usable with "service accounts" (API key replacements since Grafana version 9) and always yielded "missing permission". I searched further and found a big lack of documentation and only a lot of hints to "Role-based access control" which apparently is somewhat implemented but only fully usable in the enterprise and cloud version of grafana.

My final straw was checking out the "admin" user on our grafana instance. I had no password and the email for it was invalid so I went into the sqlite database and replace it by "osd-admins@suse.de" and requested a password reset. With that user I was actually able to elevate more users on our instance to "instance admins" which apparently gives even more permissions (e.g. create new organizations). With that permission it is also possible to reload the provisioned alerts by using the following curl call:

curl -X POST -H "Content-Type: application/json" -s 'https://stats.openqa-monitor.qa.suse.de/api/admin/provisioning/alerting/reload' -u "instance_admin:password"

I now have all the building blocks in place to automate this with our salt deployment which I will try to realize next.

Actions

Copy link

Updated by openqa_review 5 months ago

Due date set to 2024-02-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz 5 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 5 months ago

Adapted ACs as discussed. nsinger powered down tumblesle and monitoring turned to "pending" so that covers AC1. Please put #154627-3 into a script that can be executed and reference in https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production

Actions

Copy link

Updated by nicksinger 5 months ago

Status changed from In Progress to Feedback

Script added with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1107 and wiki adjusted to reference it: https://progress.opensuse.org/projects/openqav3/wiki/Wiki/diff?utf8=%E2%9C%93&commit=View+differences&version=290&version_from=289 - this should cover AC3 if the MR gets merged.

AC2 is fulfilled as well because for tumblesle we only got a single mail "[FIRING:1] host_up (tumblesle: host up alert Generic tumblesle host_up_alert_tumblesle generic)" which is what we expect and have configured. I started the VM once again which resolved the alert.

Actions

Copy link

Updated by okurz 5 months ago

Due date deleted (~~2024-02-18~~)
Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1107 merged. Very nice!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #154627

[potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 6 months ago

Updated by nicksinger 6 months ago

Updated by nicksinger 6 months ago

Updated by openqa_review 5 months ago

Updated by okurz 5 months ago

Updated by okurz 5 months ago

Updated by nicksinger 5 months ago

Updated by okurz 5 months ago