action #128870: [alert] DatasourceError for "InfluxDB not reachable" size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #128870

open

[alert] DatasourceError for "InfluxDB not reachable" size:M

Added by okurz about 2 years ago. Updated 10 months ago.

Status:

Blocked

Priority:

Low

Assignee:

okurz

Category:

Target version:

QA (public) - future

Start date:

2023-05-07

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

See http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=2

Reproducible¶

every time the monitoring host reboots, e.g. weekly reboot maintenance window in Sunday morning

Suggestions¶

We already configured all other alerts to not alert on no data and execution error, only for this specific alert we did not do that so that we are actually alerted if there is no data arriving in influxdb. Maybe we can still select to treat "ExecutionError" as ok and only alert on NoDataError
Crosscheck the state history in https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=2&from=1684012870771&to=1684068285405&editPanel=2&tab=alert# . It seems in 2023-04 the alert seems to have worked fine with the "pending" state
track upstream issue https://github.com/grafana/grafana/issues/16290 which is still open

Rollback actions¶

remove silence for "InfluxDB not reachable"

Actions

Copy link

Updated by mkittler about 2 years ago

Status changed from New to In Progress
Assignee set to mkittler

Judging by the state history treating "ExecutionError" as ok would get rid of the mail. Supposedly we'd then no longer alerted in case InfluxDB stays down. I'm wondering whether one can configure a grace period for this.

Actions

Copy link

Updated by mkittler about 2 years ago

There seems to be some grace period in place, e.g. we've got "Pending (error)" on 2023-04-23 03:40:58 and then "Normal (nodata)"/"Normal (missingseries)"¹ three minutes later on 2023-04-23 03:43:50 with no "Error" in between. The same was also the case the week before and another week before.

After that week on 2023-04-30 03:38:44 there was just "Error" with no preceding "Pending (error)". As if there was no grace period at all anymore. I suppose I'll have to find out what changed between 2023-04-23 03:40:58 and 2023-04-30 03:38:4.

The state history also looks weirdly broken. Apparently some parts of the internally used JSON cannot be parsed by the fronted so the data is torn into two distinct tables. Maybe something worth reporting upstream.

¹ Not sure what's the difference between those two. Note that having no data is actually normal for this dummy query as it is really just about alerting in the error case. So configuring execErrState: OK would mean it would be completely useless and if we wanted that we should remove it completely.

Actions

Copy link

Updated by mkittler about 2 years ago

Judging by the output of openqa-monitor:/home/martchus # cat /var/log/zypp/history | grep -i grafana the Grafana version 9.4.7 has already been installed over one week before the 04-23. So the service has already been running at 9.4.7 when the issue has not yet been present (at least not within the time frame we know the state history of).

Actions

Copy link

Updated by okurz about 2 years ago

Regarding configurable grace period for execution error https://github.com/grafana/grafana/issues/55320 should help, closed 3 weeks ago but it's unclear in which release this should show up. https://github.com/grafana/grafana/pull/65574 says it's part of milestone 9.5.0. https://build.opensuse.org/package/show/openSUSE:Factory/grafana is already 9.5.1 . https://build.opensuse.org/package/show/devel:openQA:monitoring/grafana looks kinda broken with source archives for both 9.4.7 and 9.5.1 included.

Actions

Copy link

Updated by okurz about 2 years ago

I called osc linkpac --force openSUSE:Factory grafana devel:openQA:monitoring but the package sources still show 9.4.7. Maybe we need to remove and then recreate the link. In https://build.opensuse.org/package/live_build_log/devel:openQA:monitoring/grafana/15.4/x86_64 I still saw it building 9.4.7 so I called

osc rdelete -m "Delete broken old 9.4.7 package (poo#128870)" devel:openQA:monitoring grafana && osc linkpac --force openSUSE:Factory grafana devel:openQA:monitoring

EDIT: As soon as the package is built and deployed on monitor.qa.suse.de this should automatically solve the problem as the release notes state:

Grafana Alerting rules with NoDataState configuration set to Alerting will now respect "For" duration.

EDIT: https://monitor.qa.suse.de/ now has grafana 9.5.1 which should have the fix for this ticket.

EDIT: I triggered a reboot of monitor.qa.suse.de but that still triggered an alert in the period https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=2&orgId=1&editPanel=2&tab=alert&from=1683656412889&to=1683656698684 . The alert is currently configured to "Evaluate Every 10s For 20m". How about we try to change that to "Evaluate every 20m for 60m"? The description in https://github.com/grafana/grafana/issues/55320#issue-1376161291 brought me to another idea: Configuring a special notification policy: On https://monitor.qa.suse.de/alerting/routes?tab=notification_policies# I created a new nested policy below the "osd-admins" one: alertname = InfluxDB not reachable, Delivered to osd-admins, Wait 20m to group instances. Rebooted 4 times and no alert triggered. Maybe that helped?

Actions

Copy link

Updated by openqa_review almost 2 years ago

Due date set to 2023-05-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz almost 2 years ago

I suggest you crosscheck what I did and if you agree you can resolve the ticket assuming that we have fixed the problem and if not then we will be informed :)

Actions

Copy link

Updated by okurz almost 2 years ago

Due date deleted (~~2023-05-24~~)
Status changed from In Progress to Resolved

we verified together

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from Resolved to Workable
Assignee deleted (~~mkittler~~)

Alert returned, we should revisit. How about we silence the alert for longer and wait for further upstream changes. AFAIK there is at least one upstream issue related still open that we can track

Actions

Copy link

#10

Updated by okurz almost 2 years ago

Subject changed from [alert] DatasourceError for "InfluxDB not reachable" to [alert] DatasourceError for "InfluxDB not reachable" size:M
Description updated (diff)
Status changed from Workable to Blocked
Assignee set to okurz

estimated, created silence, tracking upstream https://github.com/grafana/grafana/issues/16290

Actions

Copy link

#11

Updated by jbaier_cz almost 2 years ago

Maybe the silence does not cover all use-cases? We have another alert

Actions

Copy link

#12

Updated by okurz almost 2 years ago

Status changed from Blocked to In Progress

Actions

Copy link

#13

Updated by okurz almost 2 years ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/865

Actions

Copy link

#14

Updated by okurz almost 2 years ago

Status changed from Feedback to Resolved

so maybe https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/865 will be enough

Actions

Copy link

#15

Updated by nicksinger over 1 year ago

Status changed from Resolved to Workable

The silence is still in place, the upstream issue is still "open" but this ticket here is resolved but used as reference for the silence so something is off here.

Actions

Copy link

#16

Updated by okurz over 1 year ago

Status changed from Workable to Blocked
Target version changed from Ready to future

True. So tracking outside the backlog as blocked on https://github.com/grafana/grafana/issues/16290

Actions

Copy link

#17

Updated by mkittler about 1 year ago

Some progress was made upstream: https://github.com/grafana/grafana/issues/16290#issuecomment-2100141960

So I guess we should wait for the next Grafana release and update the packaging and see whether it helps.

Actions

Copy link

#18

Updated by okurz 10 months ago

Priority changed from High to Low

I couldn't identify which release on https://github.com/grafana/grafana/releases would include the relevant fix. But regardless https://build.opensuse.org/projects/server:monitoring/packages/grafana/files/grafana.changes?expand=1 only shows changes up to 2024-04 so no chance to have a fix with releases this old. Need to wait until somebody would be able to update grafana.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #128870

[alert] DatasourceError for "InfluxDB not reachable" size:M

Observation¶

Reproducible¶

Suggestions¶

Rollback actions¶

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by openqa_review almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by jbaier_cz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz 10 months ago