Project

General

Profile

Actions

action #128870

open

[alert] DatasourceError for "InfluxDB not reachable" size:M

Added by okurz about 1 year ago. Updated 3 months ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-05-07
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

See http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=2

Reproducible

every time the monitoring host reboots, e.g. weekly reboot maintenance window in Sunday morning

Suggestions

Rollback actions

  • remove silence for "InfluxDB not reachable"
Actions #1

Updated by mkittler about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

Judging by the state history treating "ExecutionError" as ok would get rid of the mail. Supposedly we'd then no longer alerted in case InfluxDB stays down. I'm wondering whether one can configure a grace period for this.

Actions #2

Updated by mkittler about 1 year ago

There seems to be some grace period in place, e.g. we've got "Pending (error)" on 2023-04-23 03:40:58 and then "Normal (nodata)"/"Normal (missingseries)"¹ three minutes later on 2023-04-23 03:43:50 with no "Error" in between. The same was also the case the week before and another week before.

After that week on 2023-04-30 03:38:44 there was just "Error" with no preceding "Pending (error)". As if there was no grace period at all anymore. I suppose I'll have to find out what changed between 2023-04-23 03:40:58 and 2023-04-30 03:38:4.

The state history also looks weirdly broken. Apparently some parts of the internally used JSON cannot be parsed by the fronted so the data is torn into two distinct tables. Maybe something worth reporting upstream.


¹ Not sure what's the difference between those two. Note that having no data is actually normal for this dummy query as it is really just about alerting in the error case. So configuring execErrState: OK would mean it would be completely useless and if we wanted that we should remove it completely.

Actions #3

Updated by mkittler about 1 year ago

Judging by the output of openqa-monitor:/home/martchus # cat /var/log/zypp/history | grep -i grafana the Grafana version 9.4.7 has already been installed over one week before the 04-23. So the service has already been running at 9.4.7 when the issue has not yet been present (at least not within the time frame we know the state history of).

Actions #4

Updated by okurz about 1 year ago

Regarding configurable grace period for execution error https://github.com/grafana/grafana/issues/55320 should help, closed 3 weeks ago but it's unclear in which release this should show up. https://github.com/grafana/grafana/pull/65574 says it's part of milestone 9.5.0. https://build.opensuse.org/package/show/openSUSE:Factory/grafana is already 9.5.1 . https://build.opensuse.org/package/show/devel:openQA:monitoring/grafana looks kinda broken with source archives for both 9.4.7 and 9.5.1 included.

Actions #5

Updated by okurz about 1 year ago

I called osc linkpac --force openSUSE:Factory grafana devel:openQA:monitoring but the package sources still show 9.4.7. Maybe we need to remove and then recreate the link. In https://build.opensuse.org/package/live_build_log/devel:openQA:monitoring/grafana/15.4/x86_64 I still saw it building 9.4.7 so I called

osc rdelete -m "Delete broken old 9.4.7 package (poo#128870)" devel:openQA:monitoring grafana && osc linkpac --force openSUSE:Factory grafana devel:openQA:monitoring

EDIT: As soon as the package is built and deployed on monitor.qa.suse.de this should automatically solve the problem as the release notes state:

Grafana Alerting rules with NoDataState configuration set to Alerting will now respect "For" duration.

EDIT: https://monitor.qa.suse.de/ now has grafana 9.5.1 which should have the fix for this ticket.

EDIT: I triggered a reboot of monitor.qa.suse.de but that still triggered an alert in the period https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=2&orgId=1&editPanel=2&tab=alert&from=1683656412889&to=1683656698684 . The alert is currently configured to "Evaluate Every 10s For 20m". How about we try to change that to "Evaluate every 20m for 60m"? The description in https://github.com/grafana/grafana/issues/55320#issue-1376161291 brought me to another idea: Configuring a special notification policy: On https://monitor.qa.suse.de/alerting/routes?tab=notification_policies# I created a new nested policy below the "osd-admins" one: alertname = InfluxDB not reachable, Delivered to osd-admins, Wait 20m to group instances. Rebooted 4 times and no alert triggered. Maybe that helped?

Actions #6

Updated by openqa_review about 1 year ago

  • Due date set to 2023-05-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by okurz about 1 year ago

I suggest you crosscheck what I did and if you agree you can resolve the ticket assuming that we have fixed the problem and if not then we will be informed :)

Actions #8

Updated by okurz about 1 year ago

  • Due date deleted (2023-05-24)
  • Status changed from In Progress to Resolved

we verified together

Actions #9

Updated by okurz about 1 year ago

  • Status changed from Resolved to Workable
  • Assignee deleted (mkittler)

Alert returned, we should revisit. How about we silence the alert for longer and wait for further upstream changes. AFAIK there is at least one upstream issue related still open that we can track

Actions #10

Updated by okurz about 1 year ago

  • Subject changed from [alert] DatasourceError for "InfluxDB not reachable" to [alert] DatasourceError for "InfluxDB not reachable" size:M
  • Description updated (diff)
  • Status changed from Workable to Blocked
  • Assignee set to okurz

estimated, created silence, tracking upstream https://github.com/grafana/grafana/issues/16290

Actions #11

Updated by jbaier_cz about 1 year ago

Maybe the silence does not cover all use-cases? We have another alert

Actions #12

Updated by okurz about 1 year ago

  • Status changed from Blocked to In Progress
Actions #13

Updated by okurz about 1 year ago

  • Status changed from In Progress to Feedback
Actions #14

Updated by okurz about 1 year ago

  • Status changed from Feedback to Resolved
Actions #15

Updated by nicksinger 10 months ago

  • Status changed from Resolved to Workable

The silence is still in place, the upstream issue is still "open" but this ticket here is resolved but used as reference for the silence so something is off here.

Actions #16

Updated by okurz 10 months ago

  • Status changed from Workable to Blocked
  • Target version changed from Ready to future

True. So tracking outside the backlog as blocked on https://github.com/grafana/grafana/issues/16290

Actions #17

Updated by mkittler 3 months ago

Some progress was made upstream: https://github.com/grafana/grafana/issues/16290#issuecomment-2100141960

So I guess we should wait for the next Grafana release and update the packaging and see whether it helps.

Actions

Also available in: Atom PDF