Project

General

Profile

Actions

action #182732

open

JSON unmarshal errors for multiple grafana alerts during provisioning size:M

Added by robert.richardson 11 days ago. Updated 4 days ago.

Status:
In Progress
Priority:
High
Category:
-
Target version:
Start date:
Due date:
2025-06-06 (Due in 6 days)
% Done:

0%

Estimated time:

Description

Observation

while working on #181703 i noticed that the grafana logs are being spammed with JSON unmarshal errors regarding mutiple rule_uids.

openqa-monitor.qa.suse.de:/var/log/grafana/grafana.log, excerpt (more rule_uids are affected):

...
logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1
c7d67566b org_id=1 t=2025-05-19T12:00:16.965700132Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=ce6d6cbee3fd622bdb1666c77311ae1c7d67566b org_id=1 version=9565 fingerprint=a57e9c58c050ca4b now=2025-05-19T12:00:10Z rule_uid=ce6d6cbee3fd622bdb1666c77311ae1
c7d67566b org_id=1 t=2025-05-19T12:00:16.965831101Z level=error msg="Failed to evaluate rule" attempt=2 error="server side expressions pipeline returned an error: failed to parse expression '
A': failed to unmarshal remarshaled classic condition body: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=b32d89067464d09c7395f0ec7badb560a9bd0a9e org_id=1 version=9579 fingerprint=c5a55fb56f1bf301 now=2025-05-19T12:00:10Z rule_uid=b32d89067464d09c7395f0ec7badb56
0a9bd0a9e org_id=1 t=2025-05-19T12:00:17.493038787Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
logger=ngalert.scheduler rule_uid=c34ac80ef9d436e929faa0e9a1ae54808e35978b org_id=1 version=9561 fingerprint=7f79f58aa93b4598 now=2025-05-19T12:00:10Z rule_uid=c34ac80ef9d436e929faa0e9a1ae548
08e35978b org_id=1 t=2025-05-19T12:00:17.902913628Z level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition b
ody: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"

When opening a affected rule in the webui (by copying a affected rule_uid and by pasting it in the appropriate section of the url), the error is will also be displayed, e.g. see
https://monitor.qa.suse.de/alerting/grafana/b32d89067464d09c7395f0ec7badb560a9bd0a9e/view

Note that the graph is not populated with data.

These errors also appear in the oldest available logs (2025-05-12), not sure when the issue originally started.

This also results in the logs growing up to ~500mb per day, although as we only keep logs for a week the filesize is currently not a big issue.

rrichardson@monitor:~> sudo ls -lh /var/log/grafana
total 1.9G
-rw-r----- 1 grafana grafana  99M May 20 09:15 grafana.log
-rw-r----- 1 grafana grafana 257M May 13 23:46 grafana.log.2025-05-13.002
-rw-r----- 1 grafana grafana 2.5M May 13 23:59 grafana.log.2025-05-14.001
-rw-r----- 1 grafana grafana 257M May 14 23:50 grafana.log.2025-05-14.002
-rw-r----- 1 grafana grafana 1.7M May 14 23:59 grafana.log.2025-05-15.001
-rw-r----- 1 grafana grafana 256M May 15 23:59 grafana.log.2025-05-16.001
-rw-r----- 1 grafana grafana 257M May 16 23:58 grafana.log.2025-05-16.002
-rw-r----- 1 grafana grafana 242K May 16 23:59 grafana.log.2025-05-17.001
-rw-r----- 1 grafana grafana 255M May 17 23:59 grafana.log.2025-05-18.001
-rw-r----- 1 grafana grafana 256M May 18 23:59 grafana.log.2025-05-19.001
-rw-r----- 1 grafana grafana 257M May 19 23:57 grafana.log.2025-05-19.002
-rw-r----- 1 grafana grafana 489K May 19 23:59 grafana.log.2025-05-20.001

Acceptance criteria

  • AC1: No obvious error messages in /var/log/grafana/grafana.log about "unmarshal number"
  • AC2: We still have all (or reasonable obvious) alerts

Suggestions

  • Continue with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1397
  • At best try out to deploy our deployment files on a local grafana instance without errors. If you really must then try it out in production on monitor.qe.nue2.suse.org and monitor logs
  • Ensure no such errors in grafana logs on monitor.qe.nue2.suse.org
  • Be smart applying the same rules on multiple files with multiple lines, e.g. sed replacements or fancy vim commands

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #181703: "QE Tools backlog - stacked" panel does not show data since 2025-04-08T12:00:00Z size:SResolvedrobert.richardson2025-05-02

Actions
Actions #1

Updated by robert.richardson 11 days ago

  • Description updated (diff)
Actions #2

Updated by robert.richardson 9 days ago · Edited

  • Priority changed from Normal to High
  • Target version changed from Tools - Next to Ready

I extracted the individual rule_uids from the grafana log, which are failing repeatedly. There are currently 72 alert in total which are not provisioned correctly and do not work.

Show list of failing alerts
rule_uid link
533ad0ef26ed4b043ed73100fd683ec3e34102d9 Link
4af054f6e5960637ddbbfe9dd9a1ff6fbd94a47d Link
4a0b3e8d42f47b0f9438e136afc1899ed1c14d4e Link
19ac80a657e42edc35159325821e6becb1a53c3b Link
184e7fc8efa6a4d40897ca440ccabf33a7b29b85 Link
2da9e7697c832f98835687d9ba039f696895b80b Link
29561dbaa8f2a4eedd8f9256f31c1f1c4b5729ae Link
9fe45ba31dd0beee3cb386f33428d6431f447f1d Link
9eedc5996057b67e56687b138210b404981b52ce Link
889eded63bc11413a1240bea2346625e4cbcf079 Link
70249fe77973c0e997769e50bcb675319bcd4595 Link
5feff69be6c75b288a9f88e686be6b7b4326fbdf Link
52c8ac74e4df57a3f9a188ed8f2916111ffd9431 Link
5263082f0ed1ae52276b7b63a0e2c67211def2b7 Link
4f52dd000f2cfc1c3ad0a89aa37d470810442617 Link
4ae7fe1f987dadaf16ea54137df39b93569a97db Link
2e1f661e468fb32834c4046d0f337744ee8f1a72 Link
2d1728402a0baf7f613fb14f0d67ce49868b8b7f Link
2aa374ac0c358871b943321f3505451144271256 Link
1cf3d38b3963ea1973daa9148a5142093e715693 Link
040e04ecbd9543cd5a48bd4720734c52fb1be44d Link
848330aaa5717df2c13ec634e34c7d5f74298f3a Link
ffd7c08874d71fe035a134d37a0a86f9d6a72865 Link
fdb46a6c77d0bbb6905804ba0d70de9d2657d5e6 Link
f751e32ec385d5e8aaf3e65ef6bec68ace1d07af Link
f23b0c0ddecc3ebbcf41e1d45e15d9c0fb61ec5b Link
eb17d9f0aae7442594544cb18693f0c8289620a6 Link
de4e513999b0b67eaa549ebfb7adb270d1735cf9 Link
cce8dd2a7002373d9e05cb967f9c5e15228f643f Link
c88a96aa41cd59bcbd46e6d01e9982ff89d83f20 Link
c34ac80ef9d436e929faa0e9a1ae54808e35978b Link
b839aae523468a1ee20dc8bfbaead92510a25bdc Link
b71ae0847e55f3cbfdf6ee25437e86e00bfcfe7b Link
b32d89067464d09c7395f0ec7badb560a9bd0a9e Link
ae1c581825cfb8924283b017b16bdacd2e4cb160 Link
a25c1bba45a95f100e4512c2598e6b8484a36510 Link
9d7ffd3f076a4b2d1c859c4ba8ebb867606b31a9 Link
8ccf71cb74435e55b8c0c45314c4727cb565e3cd Link
85391009299fa2ffe8d3447960ee535b07877a72 Link
69a278f04f65289630c36ea9a416943c912b2cbf Link
68a0d49b8469d3933a73dc0eb57a2c4d3f1b6789 Link
5e3999edd4bb63f2e541c1e870e23d6b7cdb8a88 Link
5bc3eefa1df9272c04416eb16a17bd3f78be02e9 Link
52503b5b861e6b560ec5fe06ea537cb73fc9e3df Link
4b9baf5f9741f637d3065535d100cf08e5d2b4b1 Link
3c21431c57914f4fd78510256415caea9a80ebe3 Link
3ae183d0c87a1035b947331ad5956de72402a49b Link
29821ed822226dcea3440ec569c6f1e24394435d Link
1842699f1f77ebafb7a504e7165df51b5c3f17e8 Link
17d9284c19fddeace8f2ea783bf56a5a8e051ea2 Link
17a9cac2b7e0852f4f690522f5d4f2b8c48d2f37 Link
1079b8f2c707e624c50f2087a43ccd68528f96eb Link
a7707d38e1fa0b395cef2ec3d8320574b15f0675 Link
ff59655ff58b6ee8f7bb201800cbbc14ecf220e3 Link
fe4f1bf1e133a9780d9635e1bcaf7002c5d90a72 Link
f9e4b815b5c1f3b862996ebf62a049778c74ad33 Link
f2f03ad11bd031f195894a3375da3dfc639e1d40 Link
d0fb109b71e712c2fe52bb640bb6a033ffd3367f Link
ce6d6cbee3fd622bdb1666c77311ae1c7d67566b Link
c0af9e9b91ecef0ef7867dd13d2d867645be17fa Link
bf66323692d78f27bc22187924ff664c23372f81 Link
b9c5aac0ff19889d4cb3bfd5513db6b59d1dac6e Link
b3260574d204e7f150c3dcb20ee3b99a82fd00b1 Link
6252c515252ed10db1dd8815f97156924e8c3c80 Link
5b1cb6f7c89bf81f544019a493ecda3c24534e72 Link
536930c353a070402e4c6482eec715c8697ebf3d Link
4f8ada285c6f96644c7985beafd88aafecbf14cd Link
4f320a0db4c2c23420c3cc53ab3a702172f95b0f Link
4a925d91d98162cabc6e189f3b60a0839a99a28e Link
3c53fe61de0b96212df8e37778e30b1a64f52ded Link
283dca227b148ba98bc16e72cc90b4f12c90a4e9 Link
fc1ee17af1901fb7587ee9d2bed5a1e5d7d1e473 Link

As the provisioning errors appear thousand of times per day, it also makes it much harder to analyze the grafana log in regards other issues.

Actions #3

Updated by okurz 9 days ago

  • Subject changed from JSON unmarshal errors for multiple grafana alerts during provisioning to JSON unmarshal errors for multiple grafana alerts during provisioning size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by robert.richardson 8 days ago

  • Status changed from Workable to In Progress
  • Assignee set to robert.richardson
Actions #5

Updated by openqa_review 8 days ago

  • Due date set to 2025-06-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by livdywan 8 days ago

For reference, commands used for testing

ssh openqa-monitor.qa.suse.de 'mkdir -p tmp && sudo cp -r /etc/grafana/provisioning/alerting tmp/alerting && sudo chown $USER:root tmp/alerting'
scp -r openqa-monitor.qa.suse.de:tmp/alerting /tmp/alerting 
Actions #7

Updated by robert.richardson 8 days ago

for completeness: i yesterday tried to manually reproduce the errors using the commands mentioned by liv in the previous comment and then provision the alerts to a local grafana container using

podman run -d --replace \
  --name grafana-test \
  -p 3000:3000 \
  -v /tmp/alerting:/etc/grafana/provisioning/alerting \
  grafana/grafana:11.5.4

I then checked the logs with podman logs grafana-test.

Sadly this is not yet sufficient, as the resulting errors will be about the missing data source, rather then the unmarshal errors found on production. So i'll now try to mount a InfluxDB container alongside to get a usable reproducer.

Actions #8

Updated by robert.richardson 8 days ago · Edited

Ok, so trying to replicate production is taking me to long, i was able to create a docker-compose.yaml that'll have a provisioned influxdb show up in the data sources list alongside the list of alerts in the grafana webui, though the alerts still do not recognize/use it and throw the same "data source not found" errors as before. I then tried to mount a copy of the entire provisioning folder from production, instead of only the provisioning/alerting subfolder, though this also needs further manual adjustments as, it seems like the dashboards are provided by salt:

logger=provisioning.dashboard type=file name=var/lib t=2025-05-23T12:43:17.102049797Z level=error msg="Failed to read content of symlinked path" path=/var/lib/grafana/dashboards error="lstat /var/lib/grafana/dashboards: no such file or directory"

also i noticed that the influxdb on production is actually manually added instead of being provisioned as monitor:/etc/grafana/provisioning/datasources is entirely empty.

To not waste more time i will resort to fix the files within openqa-monitor.qa.suse.de:/etc/grafana/provisioning/alerting directly, if anyone has tips on how to properly reproduce the provisioning process locally, please let me know.

Actions #9

Updated by livdywan 4 days ago

Monitor seemingly unreachable? Related?

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4370736

monitor.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
    
    salt-run jobs.lookup_jid 20250527103741269266
Actions #10

Updated by livdywan 3 days ago

  • Related to action #181703: "QE Tools backlog - stacked" panel does not show data since 2025-04-08T12:00:00Z size:S added
Actions

Also available in: Atom PDF