Project

General

Profile

action #182732

Updated by okurz 10 days ago

## Observation 
 while working on #181703 i noticed that the grafana logs are being spammed with `JSON unmarshal errors` regarding mutiple rule_uids. 

 *openqa-monitor.qa.suse.de:/var/log/grafana/grafana.log*, **excerpt (more rule_uids are affected)**: 
 ``` 
 ... 
 logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1 
 c7d67566b org_id=1 t=2025-05-19T12:00:16.965700132Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b 
 ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64" 
 logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1 
 c7d67566b org_id=1 t=2025-05-19T12:00:16.965831101Z level=error msg="Failed to evaluate rule" attempt=2 error="server side expressions pipeline returned an error: failed to parse expression ' 
 A': failed to unmarshal remarshaled classic condition body: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64" 
 logger=ngalert.scheduler rule_uid=b32d89067464d09c7395f0ec7badb560a9bd0a9e org_id=1 version=9579 fingerprint=c5a55fb56f1bf301 now=2025-05-19T12:00:10Z rule_uid=b32d89067464d09c7395f0ec7badb56 
 0a9bd0a9e org_id=1 t=2025-05-19T12:00:17.493038787Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b 
 ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64" 
 logger=ngalert.scheduler rule_uid=c34ac80ef9d436e929faa0e9a1ae54808e35978b org_id=1 version=9561 fingerprint=7f79f58aa93b4598 now=2025-05-19T12:00:10Z rule_uid=c34ac80ef9d436e929faa0e9a1ae548 
 08e35978b org_id=1 t=2025-05-19T12:00:17.902913628Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b 
 ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64" 
 ``` 

 When opening a affected rule in the webui (by copying a affected rule_uid and by pasting it in the appropriate section of the url), the error is will also be displayed, e.g. see 
 https://monitor.qa.suse.de/alerting/grafana/b32d89067464d09c7395f0ec7badb560a9bd0a9e/view 

 Note that the graph **is not populated with data**. 

 These errors also appear in the oldest available logs (2025-05-12), not sure when the issue originally started.  

 This also results in the logs growing up to ~500mb per day, although as we only keep logs for a week the filesize is currently not a big issue. 

 ``` 
 rrichardson@monitor:~> sudo ls -lh /var/log/grafana 
 total 1.9G 
 -rw-r----- 1 grafana grafana    99M May 20 09:15 grafana.log 
 -rw-r----- 1 grafana grafana 257M May 13 23:46 grafana.log.2025-05-13.002 
 -rw-r----- 1 grafana grafana 2.5M May 13 23:59 grafana.log.2025-05-14.001 
 -rw-r----- 1 grafana grafana 257M May 14 23:50 grafana.log.2025-05-14.002 
 -rw-r----- 1 grafana grafana 1.7M May 14 23:59 grafana.log.2025-05-15.001 
 -rw-r----- 1 grafana grafana 256M May 15 23:59 grafana.log.2025-05-16.001 
 -rw-r----- 1 grafana grafana 257M May 16 23:58 grafana.log.2025-05-16.002 
 -rw-r----- 1 grafana grafana 242K May 16 23:59 grafana.log.2025-05-17.001 
 -rw-r----- 1 grafana grafana 255M May 17 23:59 grafana.log.2025-05-18.001 
 -rw-r----- 1 grafana grafana 256M May 18 23:59 grafana.log.2025-05-19.001 
 -rw-r----- 1 grafana grafana 257M May 19 23:57 grafana.log.2025-05-19.002 
 -rw-r----- 1 grafana grafana 489K May 19 23:59 grafana.log.2025-05-20.001 
 ``` 

 ## Acceptance criteria 
 * **AC1:** No obvious error messages in /var/log/grafana/grafana.log about "unmarshal number" 
 * **AC2:** We still have all (or reasonable obvious) alerts 

 ## Suggestions 
 * Continue with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1397 
 * At best try out to deploy our deployment files on a local grafana instance without errors. If you really must then try it out in production on monitor.qe.nue2.suse.org and monitor logs 
 * Ensure no such errors in grafana logs on monitor.qe.nue2.suse.org 
 * Be smart applying the same rules on multiple files with multiple lines, e.g. `sed` replacements or fancy vim commands 




Back