action #128870
open[alert] DatasourceError for "InfluxDB not reachable" size:M
0%
Description
Observation¶
See http://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk?orgId=1&viewPanel=2
Reproducible¶
every time the monitoring host reboots, e.g. weekly reboot maintenance window in Sunday morning
Suggestions¶
- We already configured all other alerts to not alert on no data and execution error, only for this specific alert we did not do that so that we are actually alerted if there is no data arriving in influxdb. Maybe we can still select to treat "ExecutionError" as ok and only alert on NoDataError
- Crosscheck the state history in https://stats.openqa-monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=2&from=1684012870771&to=1684068285405&editPanel=2&tab=alert# . It seems in 2023-04 the alert seems to have worked fine with the "pending" state
- track upstream issue https://github.com/grafana/grafana/issues/16290 which is still open
Rollback actions¶
- remove silence for "InfluxDB not reachable"
Updated by mkittler over 1 year ago
- Status changed from New to In Progress
- Assignee set to mkittler
Judging by the state history treating "ExecutionError" as ok would get rid of the mail. Supposedly we'd then no longer alerted in case InfluxDB stays down. I'm wondering whether one can configure a grace period for this.
Updated by mkittler over 1 year ago
There seems to be some grace period in place, e.g. we've got "Pending (error)" on 2023-04-23 03:40:58 and then "Normal (nodata)"/"Normal (missingseries)"¹ three minutes later on 2023-04-23 03:43:50 with no "Error" in between. The same was also the case the week before and another week before.
After that week on 2023-04-30 03:38:44 there was just "Error" with no preceding "Pending (error)". As if there was no grace period at all anymore. I suppose I'll have to find out what changed between 2023-04-23 03:40:58 and 2023-04-30 03:38:4.
The state history also looks weirdly broken. Apparently some parts of the internally used JSON cannot be parsed by the fronted so the data is torn into two distinct tables. Maybe something worth reporting upstream.
¹ Not sure what's the difference between those two. Note that having no data is actually normal for this dummy query as it is really just about alerting in the error case. So configuring execErrState: OK
would mean it would be completely useless and if we wanted that we should remove it completely.
Updated by mkittler over 1 year ago
Judging by the output of openqa-monitor:/home/martchus # cat /var/log/zypp/history | grep -i grafana
the Grafana version 9.4.7 has already been installed over one week before the 04-23. So the service has already been running at 9.4.7 when the issue has not yet been present (at least not within the time frame we know the state history of).
Updated by okurz over 1 year ago
Regarding configurable grace period for execution error https://github.com/grafana/grafana/issues/55320 should help, closed 3 weeks ago but it's unclear in which release this should show up. https://github.com/grafana/grafana/pull/65574 says it's part of milestone 9.5.0. https://build.opensuse.org/package/show/openSUSE:Factory/grafana is already 9.5.1 . https://build.opensuse.org/package/show/devel:openQA:monitoring/grafana looks kinda broken with source archives for both 9.4.7 and 9.5.1 included.
Updated by okurz over 1 year ago
I called osc linkpac --force openSUSE:Factory grafana devel:openQA:monitoring
but the package sources still show 9.4.7. Maybe we need to remove and then recreate the link. In https://build.opensuse.org/package/live_build_log/devel:openQA:monitoring/grafana/15.4/x86_64 I still saw it building 9.4.7 so I called
osc rdelete -m "Delete broken old 9.4.7 package (poo#128870)" devel:openQA:monitoring grafana && osc linkpac --force openSUSE:Factory grafana devel:openQA:monitoring
EDIT: As soon as the package is built and deployed on monitor.qa.suse.de this should automatically solve the problem as the release notes state:
Grafana Alerting rules with NoDataState configuration set to Alerting will now respect "For" duration.
EDIT: https://monitor.qa.suse.de/ now has grafana 9.5.1 which should have the fix for this ticket.
EDIT: I triggered a reboot of monitor.qa.suse.de but that still triggered an alert in the period https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?viewPanel=2&orgId=1&editPanel=2&tab=alert&from=1683656412889&to=1683656698684 . The alert is currently configured to "Evaluate Every 10s For 20m". How about we try to change that to "Evaluate every 20m for 60m"? The description in https://github.com/grafana/grafana/issues/55320#issue-1376161291 brought me to another idea: Configuring a special notification policy: On https://monitor.qa.suse.de/alerting/routes?tab=notification_policies# I created a new nested policy below the "osd-admins" one: alertname = InfluxDB not reachable, Delivered to osd-admins, Wait 20m to group instances
. Rebooted 4 times and no alert triggered. Maybe that helped?
Updated by openqa_review over 1 year ago
- Due date set to 2023-05-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
I suggest you crosscheck what I did and if you agree you can resolve the ticket assuming that we have fixed the problem and if not then we will be informed :)
Updated by okurz over 1 year ago
- Due date deleted (
2023-05-24) - Status changed from In Progress to Resolved
we verified together
Updated by okurz over 1 year ago
- Status changed from Resolved to Workable
- Assignee deleted (
mkittler)
Alert returned, we should revisit. How about we silence the alert for longer and wait for further upstream changes. AFAIK there is at least one upstream issue related still open that we can track
Updated by okurz over 1 year ago
- Subject changed from [alert] DatasourceError for "InfluxDB not reachable" to [alert] DatasourceError for "InfluxDB not reachable" size:M
- Description updated (diff)
- Status changed from Workable to Blocked
- Assignee set to okurz
estimated, created silence, tracking upstream https://github.com/grafana/grafana/issues/16290
Updated by jbaier_cz over 1 year ago
Maybe the silence does not cover all use-cases? We have another alert
Updated by okurz over 1 year ago
- Status changed from In Progress to Feedback
Updated by okurz over 1 year ago
- Status changed from Feedback to Resolved
so maybe https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/865 will be enough
Updated by nicksinger over 1 year ago
- Status changed from Resolved to Workable
The silence is still in place, the upstream issue is still "open" but this ticket here is resolved but used as reference for the silence so something is off here.
Updated by okurz over 1 year ago
- Status changed from Workable to Blocked
- Target version changed from Ready to future
True. So tracking outside the backlog as blocked on https://github.com/grafana/grafana/issues/16290
Updated by mkittler 8 months ago
Some progress was made upstream: https://github.com/grafana/grafana/issues/16290#issuecomment-2100141960
So I guess we should wait for the next Grafana release and update the packaging and see whether it helps.
Updated by okurz 6 months ago
- Priority changed from High to Low
I couldn't identify which release on https://github.com/grafana/grafana/releases would include the relevant fix. But regardless https://build.opensuse.org/projects/server:monitoring/packages/grafana/files/grafana.changes?expand=1 only shows changes up to 2024-04 so no chance to have a fix with releases this old. Need to wait until somebody would be able to update grafana.