action #128870
open
[alert] DatasourceError for "InfluxDB not reachable" size:M
Added by okurz 12 months ago.
Updated 8 months ago.
- Status changed from New to In Progress
- Assignee set to mkittler
Judging by the state history treating "ExecutionError" as ok would get rid of the mail. Supposedly we'd then no longer alerted in case InfluxDB stays down. I'm wondering whether one can configure a grace period for this.
There seems to be some grace period in place, e.g. we've got "Pending (error)" on 2023-04-23 03:40:58 and then "Normal (nodata)"/"Normal (missingseries)"¹ three minutes later on 2023-04-23 03:43:50 with no "Error" in between. The same was also the case the week before and another week before.
After that week on 2023-04-30 03:38:44 there was just "Error" with no preceding "Pending (error)". As if there was no grace period at all anymore. I suppose I'll have to find out what changed between 2023-04-23 03:40:58 and 2023-04-30 03:38:4.
The state history also looks weirdly broken. Apparently some parts of the internally used JSON cannot be parsed by the fronted so the data is torn into two distinct tables. Maybe something worth reporting upstream.
¹ Not sure what's the difference between those two. Note that having no data is actually normal for this dummy query as it is really just about alerting in the error case. So configuring execErrState: OK
would mean it would be completely useless and if we wanted that we should remove it completely.
Judging by the output of openqa-monitor:/home/martchus # cat /var/log/zypp/history | grep -i grafana
the Grafana version 9.4.7 has already been installed over one week before the 04-23. So the service has already been running at 9.4.7 when the issue has not yet been present (at least not within the time frame we know the state history of).
- Due date set to 2023-05-24
Setting due date based on mean cycle time of SUSE QE Tools
I suggest you crosscheck what I did and if you agree you can resolve the ticket assuming that we have fixed the problem and if not then we will be informed :)
- Due date deleted (
2023-05-24)
- Status changed from In Progress to Resolved
- Status changed from Resolved to Workable
- Assignee deleted (
mkittler)
Alert returned, we should revisit. How about we silence the alert for longer and wait for further upstream changes. AFAIK there is at least one upstream issue related still open that we can track
- Subject changed from [alert] DatasourceError for "InfluxDB not reachable" to [alert] DatasourceError for "InfluxDB not reachable" size:M
- Description updated (diff)
- Status changed from Workable to Blocked
- Assignee set to okurz
Maybe the silence does not cover all use-cases? We have another alert
- Status changed from Blocked to In Progress
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
- Status changed from Resolved to Workable
The silence is still in place, the upstream issue is still "open" but this ticket here is resolved but used as reference for the silence so something is off here.
- Status changed from Workable to Blocked
- Target version changed from Ready to future
Also available in: Atom
PDF