Project

General

Profile

Actions

action #182732

open

JSON unmarshal errors for multiple grafana alerts during provisioning size:M

Added by robert.richardson 12 days ago. Updated 5 days ago.

Status:
In Progress
Priority:
High
Category:
-
Target version:
Start date:
Due date:
2025-06-06 (Due in 5 days)
% Done:

0%

Estimated time:

Description

Observation

while working on #181703 i noticed that the grafana logs are being spammed with JSON unmarshal errors regarding mutiple rule_uids.

openqa-monitor.qa.suse.de:/var/log/grafana/grafana.log, excerpt (more rule_uids are affected):

...
logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1
c7d67566b org_id=1 t=2025-05-19T12:00:16.965700132Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1
c7d67566b org_id=1 t=2025-05-19T12:00:16.965831101Z level=error msg="Failed to evaluate rule" attempt=2 error="server side expressions pipeline returned an error: failed to parse expression '
A': failed to unmarshal remarshaled classic condition body: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=b32d89067464d09c7395f0ec7badb560a9bd0a9e org_id=1 version=9579 fingerprint=c5a55fb56f1bf301 now=2025-05-19T12:00:10Z rule_uid=b32d89067464d09c7395f0ec7badb56
0a9bd0a9e org_id=1 t=2025-05-19T12:00:17.493038787Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=c34ac80ef9d436e929faa0e9a1ae54808e35978b org_id=1 version=9561 fingerprint=7f79f58aa93b4598 now=2025-05-19T12:00:10Z rule_uid=c34ac80ef9d436e929faa0e9a1ae548
08e35978b org_id=1 t=2025-05-19T12:00:17.902913628Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"

When opening a affected rule in the webui (by copying a affected rule_uid and by pasting it in the appropriate section of the url), the error is will also be displayed, e.g. see
https://monitor.qa.suse.de/alerting/grafana/b32d89067464d09c7395f0ec7badb560a9bd0a9e/view

Note that the graph is not populated with data.

These errors also appear in the oldest available logs (2025-05-12), not sure when the issue originally started.

This also results in the logs growing up to ~500mb per day, although as we only keep logs for a week the filesize is currently not a big issue.

rrichardson@monitor:~> sudo ls -lh /var/log/grafana
total 1.9G
-rw-r----- 1 grafana grafana  99M May 20 09:15 grafana.log
-rw-r----- 1 grafana grafana 257M May 13 23:46 grafana.log.2025-05-13.002
-rw-r----- 1 grafana grafana 2.5M May 13 23:59 grafana.log.2025-05-14.001
-rw-r----- 1 grafana grafana 257M May 14 23:50 grafana.log.2025-05-14.002
-rw-r----- 1 grafana grafana 1.7M May 14 23:59 grafana.log.2025-05-15.001
-rw-r----- 1 grafana grafana 256M May 15 23:59 grafana.log.2025-05-16.001
-rw-r----- 1 grafana grafana 257M May 16 23:58 grafana.log.2025-05-16.002
-rw-r----- 1 grafana grafana 242K May 16 23:59 grafana.log.2025-05-17.001
-rw-r----- 1 grafana grafana 255M May 17 23:59 grafana.log.2025-05-18.001
-rw-r----- 1 grafana grafana 256M May 18 23:59 grafana.log.2025-05-19.001
-rw-r----- 1 grafana grafana 257M May 19 23:57 grafana.log.2025-05-19.002
-rw-r----- 1 grafana grafana 489K May 19 23:59 grafana.log.2025-05-20.001

Acceptance criteria

  • AC1: No obvious error messages in /var/log/grafana/grafana.log about "unmarshal number"
  • AC2: We still have all (or reasonable obvious) alerts

Suggestions

  • Continue with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1397
  • At best try out to deploy our deployment files on a local grafana instance without errors. If you really must then try it out in production on monitor.qe.nue2.suse.org and monitor logs
  • Ensure no such errors in grafana logs on monitor.qe.nue2.suse.org
  • Be smart applying the same rules on multiple files with multiple lines, e.g. sed replacements or fancy vim commands

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #181703: "QE Tools backlog - stacked" panel does not show data since 2025-04-08T12:00:00Z size:SResolvedrobert.richardson2025-05-02

Actions
Actions

Also available in: Atom PDF