action #182732
Updated by okurz 10 days ago
## Observation
while working on #181703 i noticed that the grafana logs are being spammed with `JSON unmarshal errors` regarding mutiple rule_uids.
*openqa-monitor.qa.suse.de:/var/log/grafana/grafana.log*, **excerpt (more rule_uids are affected)**:
```
...
logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1
c7d67566b org_id=1 t=2025-05-19T12:00:16.965700132Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1
c7d67566b org_id=1 t=2025-05-19T12:00:16.965831101Z level=error msg="Failed to evaluate rule" attempt=2 error="server side expressions pipeline returned an error: failed to parse expression '
A': failed to unmarshal remarshaled classic condition body: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=b32d89067464d09c7395f0ec7badb560a9bd0a9e org_id=1 version=9579 fingerprint=c5a55fb56f1bf301 now=2025-05-19T12:00:10Z rule_uid=b32d89067464d09c7395f0ec7badb56
0a9bd0a9e org_id=1 t=2025-05-19T12:00:17.493038787Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=c34ac80ef9d436e929faa0e9a1ae54808e35978b org_id=1 version=9561 fingerprint=7f79f58aa93b4598 now=2025-05-19T12:00:10Z rule_uid=c34ac80ef9d436e929faa0e9a1ae548
08e35978b org_id=1 t=2025-05-19T12:00:17.902913628Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
```
When opening a affected rule in the webui (by copying a affected rule_uid and by pasting it in the appropriate section of the url), the error is will also be displayed, e.g. see
https://monitor.qa.suse.de/alerting/grafana/b32d89067464d09c7395f0ec7badb560a9bd0a9e/view
Note that the graph **is not populated with data**.
These errors also appear in the oldest available logs (2025-05-12), not sure when the issue originally started.
This also results in the logs growing up to ~500mb per day, although as we only keep logs for a week the filesize is currently not a big issue.
```
rrichardson@monitor:~> sudo ls -lh /var/log/grafana
total 1.9G
-rw-r----- 1 grafana grafana 99M May 20 09:15 grafana.log
-rw-r----- 1 grafana grafana 257M May 13 23:46 grafana.log.2025-05-13.002
-rw-r----- 1 grafana grafana 2.5M May 13 23:59 grafana.log.2025-05-14.001
-rw-r----- 1 grafana grafana 257M May 14 23:50 grafana.log.2025-05-14.002
-rw-r----- 1 grafana grafana 1.7M May 14 23:59 grafana.log.2025-05-15.001
-rw-r----- 1 grafana grafana 256M May 15 23:59 grafana.log.2025-05-16.001
-rw-r----- 1 grafana grafana 257M May 16 23:58 grafana.log.2025-05-16.002
-rw-r----- 1 grafana grafana 242K May 16 23:59 grafana.log.2025-05-17.001
-rw-r----- 1 grafana grafana 255M May 17 23:59 grafana.log.2025-05-18.001
-rw-r----- 1 grafana grafana 256M May 18 23:59 grafana.log.2025-05-19.001
-rw-r----- 1 grafana grafana 257M May 19 23:57 grafana.log.2025-05-19.002
-rw-r----- 1 grafana grafana 489K May 19 23:59 grafana.log.2025-05-20.001
```
## Acceptance criteria
* **AC1:** No obvious error messages in /var/log/grafana/grafana.log about "unmarshal number"
* **AC2:** We still have all (or reasonable obvious) alerts
## Suggestions
* Continue with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1397
* At best try out to deploy our deployment files on a local grafana instance without errors. If you really must then try it out in production on monitor.qe.nue2.suse.org and monitor logs
* Ensure no such errors in grafana logs on monitor.qe.nue2.suse.org
* Be smart applying the same rules on multiple files with multiple lines, e.g. `sed` replacements or fancy vim commands
Back