action #182732
openJSON unmarshal errors for multiple grafana alerts during provisioning size:M
0%
Description
Observation¶
while working on #181703 i noticed that the grafana logs are being spammed with JSON unmarshal errors
regarding mutiple rule_uids.
openqa-monitor.qa.suse.de:/var/log/grafana/grafana.log, excerpt (more rule_uids are affected):
...
logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1
c7d67566b org_id=1 t=2025-05-19T12:00:16.965700132Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1
c7d67566b org_id=1 t=2025-05-19T12:00:16.965831101Z level=error msg="Failed to evaluate rule" attempt=2 error="server side expressions pipeline returned an error: failed to parse expression '
A': failed to unmarshal remarshaled classic condition body: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=b32d89067464d09c7395f0ec7badb560a9bd0a9e org_id=1 version=9579 fingerprint=c5a55fb56f1bf301 now=2025-05-19T12:00:10Z rule_uid=b32d89067464d09c7395f0ec7badb56
0a9bd0a9e org_id=1 t=2025-05-19T12:00:17.493038787Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=c34ac80ef9d436e929faa0e9a1ae54808e35978b org_id=1 version=9561 fingerprint=7f79f58aa93b4598 now=2025-05-19T12:00:10Z rule_uid=c34ac80ef9d436e929faa0e9a1ae548
08e35978b org_id=1 t=2025-05-19T12:00:17.902913628Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
When opening a affected rule in the webui (by copying a affected rule_uid and by pasting it in the appropriate section of the url), the error is will also be displayed, e.g. see
https://monitor.qa.suse.de/alerting/grafana/b32d89067464d09c7395f0ec7badb560a9bd0a9e/view
Note that the graph is not populated with data.
These errors also appear in the oldest available logs (2025-05-12), not sure when the issue originally started.
This also results in the logs growing up to ~500mb per day, although as we only keep logs for a week the filesize is currently not a big issue.
rrichardson@monitor:~> sudo ls -lh /var/log/grafana
total 1.9G
-rw-r----- 1 grafana grafana 99M May 20 09:15 grafana.log
-rw-r----- 1 grafana grafana 257M May 13 23:46 grafana.log.2025-05-13.002
-rw-r----- 1 grafana grafana 2.5M May 13 23:59 grafana.log.2025-05-14.001
-rw-r----- 1 grafana grafana 257M May 14 23:50 grafana.log.2025-05-14.002
-rw-r----- 1 grafana grafana 1.7M May 14 23:59 grafana.log.2025-05-15.001
-rw-r----- 1 grafana grafana 256M May 15 23:59 grafana.log.2025-05-16.001
-rw-r----- 1 grafana grafana 257M May 16 23:58 grafana.log.2025-05-16.002
-rw-r----- 1 grafana grafana 242K May 16 23:59 grafana.log.2025-05-17.001
-rw-r----- 1 grafana grafana 255M May 17 23:59 grafana.log.2025-05-18.001
-rw-r----- 1 grafana grafana 256M May 18 23:59 grafana.log.2025-05-19.001
-rw-r----- 1 grafana grafana 257M May 19 23:57 grafana.log.2025-05-19.002
-rw-r----- 1 grafana grafana 489K May 19 23:59 grafana.log.2025-05-20.001
Acceptance criteria¶
- AC1: No obvious error messages in /var/log/grafana/grafana.log about "unmarshal number"
- AC2: We still have all (or reasonable obvious) alerts
Suggestions¶
- Continue with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1397
- At best try out to deploy our deployment files on a local grafana instance without errors. If you really must then try it out in production on monitor.qe.nue2.suse.org and monitor logs
- Ensure no such errors in grafana logs on monitor.qe.nue2.suse.org
- Be smart applying the same rules on multiple files with multiple lines, e.g.
sed
replacements or fancy vim commands
Updated by robert.richardson 9 days ago · Edited
- Priority changed from Normal to High
- Target version changed from Tools - Next to Ready
I extracted the individual rule_uids from the grafana log, which are failing repeatedly. There are currently 72 alert in total which are not provisioned correctly and do not work.
Show list of failing alerts
rule_uid | link |
---|---|
533ad0ef26ed4b043ed73100fd683ec3e34102d9 | Link |
4af054f6e5960637ddbbfe9dd9a1ff6fbd94a47d | Link |
4a0b3e8d42f47b0f9438e136afc1899ed1c14d4e | Link |
19ac80a657e42edc35159325821e6becb1a53c3b | Link |
184e7fc8efa6a4d40897ca440ccabf33a7b29b85 | Link |
2da9e7697c832f98835687d9ba039f696895b80b | Link |
29561dbaa8f2a4eedd8f9256f31c1f1c4b5729ae | Link |
9fe45ba31dd0beee3cb386f33428d6431f447f1d | Link |
9eedc5996057b67e56687b138210b404981b52ce | Link |
889eded63bc11413a1240bea2346625e4cbcf079 | Link |
70249fe77973c0e997769e50bcb675319bcd4595 | Link |
5feff69be6c75b288a9f88e686be6b7b4326fbdf | Link |
52c8ac74e4df57a3f9a188ed8f2916111ffd9431 | Link |
5263082f0ed1ae52276b7b63a0e2c67211def2b7 | Link |
4f52dd000f2cfc1c3ad0a89aa37d470810442617 | Link |
4ae7fe1f987dadaf16ea54137df39b93569a97db | Link |
2e1f661e468fb32834c4046d0f337744ee8f1a72 | Link |
2d1728402a0baf7f613fb14f0d67ce49868b8b7f | Link |
2aa374ac0c358871b943321f3505451144271256 | Link |
1cf3d38b3963ea1973daa9148a5142093e715693 | Link |
040e04ecbd9543cd5a48bd4720734c52fb1be44d | Link |
848330aaa5717df2c13ec634e34c7d5f74298f3a | Link |
ffd7c08874d71fe035a134d37a0a86f9d6a72865 | Link |
fdb46a6c77d0bbb6905804ba0d70de9d2657d5e6 | Link |
f751e32ec385d5e8aaf3e65ef6bec68ace1d07af | Link |
f23b0c0ddecc3ebbcf41e1d45e15d9c0fb61ec5b | Link |
eb17d9f0aae7442594544cb18693f0c8289620a6 | Link |
de4e513999b0b67eaa549ebfb7adb270d1735cf9 | Link |
cce8dd2a7002373d9e05cb967f9c5e15228f643f | Link |
c88a96aa41cd59bcbd46e6d01e9982ff89d83f20 | Link |
c34ac80ef9d436e929faa0e9a1ae54808e35978b | Link |
b839aae523468a1ee20dc8bfbaead92510a25bdc | Link |
b71ae0847e55f3cbfdf6ee25437e86e00bfcfe7b | Link |
b32d89067464d09c7395f0ec7badb560a9bd0a9e | Link |
ae1c581825cfb8924283b017b16bdacd2e4cb160 | Link |
a25c1bba45a95f100e4512c2598e6b8484a36510 | Link |
9d7ffd3f076a4b2d1c859c4ba8ebb867606b31a9 | Link |
8ccf71cb74435e55b8c0c45314c4727cb565e3cd | Link |
85391009299fa2ffe8d3447960ee535b07877a72 | Link |
69a278f04f65289630c36ea9a416943c912b2cbf | Link |
68a0d49b8469d3933a73dc0eb57a2c4d3f1b6789 | Link |
5e3999edd4bb63f2e541c1e870e23d6b7cdb8a88 | Link |
5bc3eefa1df9272c04416eb16a17bd3f78be02e9 | Link |
52503b5b861e6b560ec5fe06ea537cb73fc9e3df | Link |
4b9baf5f9741f637d3065535d100cf08e5d2b4b1 | Link |
3c21431c57914f4fd78510256415caea9a80ebe3 | Link |
3ae183d0c87a1035b947331ad5956de72402a49b | Link |
29821ed822226dcea3440ec569c6f1e24394435d | Link |
1842699f1f77ebafb7a504e7165df51b5c3f17e8 | Link |
17d9284c19fddeace8f2ea783bf56a5a8e051ea2 | Link |
17a9cac2b7e0852f4f690522f5d4f2b8c48d2f37 | Link |
1079b8f2c707e624c50f2087a43ccd68528f96eb | Link |
a7707d38e1fa0b395cef2ec3d8320574b15f0675 | Link |
ff59655ff58b6ee8f7bb201800cbbc14ecf220e3 | Link |
fe4f1bf1e133a9780d9635e1bcaf7002c5d90a72 | Link |
f9e4b815b5c1f3b862996ebf62a049778c74ad33 | Link |
f2f03ad11bd031f195894a3375da3dfc639e1d40 | Link |
d0fb109b71e712c2fe52bb640bb6a033ffd3367f | Link |
ce6d6cbee3fd622bdb1666c77311ae1c7d67566b | Link |
c0af9e9b91ecef0ef7867dd13d2d867645be17fa | Link |
bf66323692d78f27bc22187924ff664c23372f81 | Link |
b9c5aac0ff19889d4cb3bfd5513db6b59d1dac6e | Link |
b3260574d204e7f150c3dcb20ee3b99a82fd00b1 | Link |
6252c515252ed10db1dd8815f97156924e8c3c80 | Link |
5b1cb6f7c89bf81f544019a493ecda3c24534e72 | Link |
536930c353a070402e4c6482eec715c8697ebf3d | Link |
4f8ada285c6f96644c7985beafd88aafecbf14cd | Link |
4f320a0db4c2c23420c3cc53ab3a702172f95b0f | Link |
4a925d91d98162cabc6e189f3b60a0839a99a28e | Link |
3c53fe61de0b96212df8e37778e30b1a64f52ded | Link |
283dca227b148ba98bc16e72cc90b4f12c90a4e9 | Link |
fc1ee17af1901fb7587ee9d2bed5a1e5d7d1e473 | Link |
As the provisioning errors appear thousand of times per day, it also makes it much harder to analyze the grafana log in regards other issues.
Updated by robert.richardson 8 days ago
- Status changed from Workable to In Progress
- Assignee set to robert.richardson
Updated by openqa_review 8 days ago
- Due date set to 2025-06-06
Setting due date based on mean cycle time of SUSE QE Tools
Updated by robert.richardson 8 days ago
for completeness: i yesterday tried to manually reproduce the errors using the commands mentioned by liv in the previous comment and then provision the alerts to a local grafana container using
podman run -d --replace \
--name grafana-test \
-p 3000:3000 \
-v /tmp/alerting:/etc/grafana/provisioning/alerting \
grafana/grafana:11.5.4
I then checked the logs with podman logs grafana-test
.
Sadly this is not yet sufficient, as the resulting errors will be about the missing data source, rather then the unmarshal errors found on production. So i'll now try to mount a InfluxDB container alongside to get a usable reproducer.
Updated by robert.richardson 8 days ago · Edited
Ok, so trying to replicate production is taking me to long, i was able to create a docker-compose.yaml that'll have a provisioned influxdb show up in the data sources list alongside the list of alerts in the grafana webui, though the alerts still do not recognize/use it and throw the same "data source not found"
errors as before. I then tried to mount a copy of the entire provisioning
folder from production, instead of only the provisioning/alerting
subfolder, though this also needs further manual adjustments as, it seems like the dashboards are provided by salt:
logger=provisioning.dashboard type=file name=var/lib t=2025-05-23T12:43:17.102049797Z level=error msg="Failed to read content of symlinked path" path=/var/lib/grafana/dashboards error="lstat /var/lib/grafana/dashboards: no such file or directory"
also i noticed that the influxdb on production is actually manually added instead of being provisioned as monitor:/etc/grafana/provisioning/datasources
is entirely empty.
To not waste more time i will resort to fix the files within openqa-monitor.qa.suse.de:/etc/grafana/provisioning/alerting
directly, if anyone has tips on how to properly reproduce the provisioning process locally, please let me know.
Updated by livdywan 4 days ago
Monitor seemingly unreachable? Related?
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4370736
monitor.qe.nue2.suse.org:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20250527103741269266
Updated by livdywan 3 days ago
- Related to action #181703: "QE Tools backlog - stacked" panel does not show data since 2025-04-08T12:00:00Z size:S added