Project

General

Profile

Actions

action #138005

closed

grafana panel "Packet loss between worker hosts and other hosts" shows more than just ping to "other hosts" and hence becomes slow and triggers redundant alerts size:M

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-10-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4 should show the results from ping regarding packet loss to "other hosts", i.e. the ones that are not in our salt control, like dist.suse.de. However the panel shows a very long and slowly rendering list including multiple redundant entries that are already covered by other panels, e.g. "worker40 - openqa.suse.de" and also "openqa - tumblesle.qe.nue2.suse.org", all of which are salt controlled.

Acceptance criteria

  • AC1: The panel should not show nor alert on any packet loss to salt controlled machines
  • AC2: The panel still shows and alerts on packet loss to any "other host"

Suggestions

Further details


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #137600: [alert] Packet loss between worker hosts and other hosts size:SResolvedokurz2023-10-09

Actions
Actions #2

Updated by livdywan about 1 year ago

  • Related to action #137600: [alert] Packet loss between worker hosts and other hosts size:S added
Actions #3

Updated by okurz about 1 year ago

  • Subject changed from grafana panel "Packet loss between worker hosts and other hosts" shows more than just ping to "other hosts" and hence becomes slow and triggers redundant alerts to grafana panel "Packet loss between worker hosts and other hosts" shows more than just ping to "other hosts" and hence becomes slow and triggers redundant alerts size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by nicksinger about 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #5

Updated by nicksinger about 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1026 introduces a tag for every external host ping which we can filter in the panel/alert views

Actions #6

Updated by nicksinger about 1 year ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1027 as follow-up to only include metrics with that tag. Manually verified by looking at the panel myself and it seems good now as I cannot find any machine not managed by salt in that panel any longer. I will track it a little longer as there is currently an alert firing and I want to check if my changes broke something or actually fix that alert.

Actions #7

Updated by okurz about 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1027 merged. I can confirm that the monitoring data looks good. Regarding the firing alert that you mentioned if I interpret https://monitor.qa.suse.de/alerting/grafana/2Z025iB4km/view?returnTo=%2Fd%2FEML0bpuGk%2Fmonitoring%3ForgId%3D1%26viewPanel%3D4%26editPanel%3D4%26tab%3Dalert correctly then maybe there is still "openqa - ada.qe.suse.de". Do you have in idea why that might still be included?

Actions #8

Updated by nicksinger about 1 year ago

  • Status changed from Feedback to In Progress

Checking if we have an old alert instance and how to get rid of it. I adjusted the alert but I remember we had some problems with old alert definitions dangling in the grafana database.

Actions #9

Updated by openqa_review about 1 year ago

  • Due date set to 2023-11-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by nicksinger about 1 year ago

  • Status changed from In Progress to Feedback

I followed our documented steps in https://gitlab.suse.de/openqa/salt-states-openqa#removing-stale-provisioned-alerts and ran sqlite3 /var/lib/grafana/grafana.db "delete from alert_rule where uid = '2Z025iB4km'; delete from alert_rule_version where rule_uid = '2Z025iB4km'; delete from provenance_type where record_key = '2Z025iB4km'; delete from annotation where text like '%2Z025iB4km%';". After a restart of grafana we now have the same UID back but clearly labeled as "This alert rule has been provisioned" and also state "Normal". I will give some time to see if it (wrongly) fires again

Actions #11

Updated by nicksinger about 1 year ago

  • Status changed from Feedback to In Progress
Actions #12

Updated by nicksinger about 1 year ago

The alert came back with wrong values and the query in the alert does not correspond to what is shown if you click "Show in explore". I will try to delete the provision file for that alert, run the delete command on the database again and see if the alert comes back and we might miss a reference somewhere else.

Actions #13

Updated by nicksinger about 1 year ago

after deleting the alert and checking again without provisioning I just found the following reference in the database:

root@monitor:/var/lib/grafana # sqlite3 grafana.db .dump | grep -i "Packet loss between worker hosts and other hosts alert" | grep -v "annotation"
INSERT INTO alert_instance VALUES(1,'2Z025iB4km','[["__alert_rule_namespace_uid__","fMCMfVKWz"],["__alert_rule_uid__","2Z025iB4km"],["__contacts__","\"osd-admins\""],["alertname","Packet loss between worker hosts and other hosts alert"],["grafana_folder","Salt"],["rule_uid","2Z025iB4km"]]','1611fe769c72086129867d2f1291ddbbc515d8cd','Alerting',1700064120,1700230020,1700230200,'');

I wonder why this is not mentioned in our docs (seems like "alert_instance" is the main table containing alerts)

Actions #14

Updated by nicksinger about 1 year ago

  • Status changed from In Progress to Feedback

The experiment from comment#13 did not work. However while I was at it I created a new alert with the query as I would expect it and realized that grafana not just adjust the "query"-parameter in the yaml but also adds a list of "tags" below it. This apparently made a difference in evaluating the query inside the alert itself. I proposed a fix in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1045

Lesson learned: don't just adjust the query to match the panel-query but rather make a copy and replace the original alert with the generated yaml of the new, adjusted alert.

Actions #15

Updated by okurz about 1 year ago

  • Due date deleted (2023-11-29)
  • Status changed from Feedback to Resolved

Cool. https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1045 merged and deployed. https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4 is quicker to load now, shows only relevant entries and there is a green heart regarding alerting. I removed the according silence with alertname=Packet loss between worker hosts and other … as it showed that it does not currently silence an alert. I guess we can resolve now.

Actions

Also available in: Atom PDF