Project

General

Profile

Actions

action #160284

closed

grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M

Added by okurz 7 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-05-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/ yields

502 Bad Gateway

From

journalctl -u grafana-server
May 13 12:04:54 monitor grafana[28845]: cannot create rule with UID 'qa_network_infra_ping_time_alert_s390zl12': UID is longer than 40 symbols
…
May 13 12:05:31 monitor grafana[29160]: cannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols

the alerts are defined from monitor:/etc/grafana/provisioning/alerting/dashboard-WDs390zl12.yaml . I temporarily changed that string locally from "too_many_minion_job_failures_alert_s390zl12" to "too_many_minion_job_failures_s390zl12" and for the other respectively. So
apparently only those two strings are problematic?

Acceptance criteria

  • AC1: grafana starts up consistently again
  • AC2: static code checks prevent us from running into the same problem before merging MRs

Suggestions

  • DONE Fix the problem transiently
  • DONE Research upstream for the problem. Maybe a new automatic grafana version upgrade triggered this? -> The feature change happened in https://github.com/grafana/grafana/commit/99fd7b8141e9cec296b810760ec0e86136ebfca0 2023-09 so some time aftwards we got the new version including this but haven't added problematically long alerts since then.
  • Understand why only the two strings mentioned in the observation pose a problem
  • Fix the problem in salt-states-openqa for all UIDs
  • Add a CI called check for UID length
Actions #1

Updated by okurz 7 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #2

Updated by tinita 7 months ago

What's weird is that we have more existing uids with more than 41 characters, but grafana doesn't complain anymore:

% grep -r " uid:" /etc/grafana/ | perl -nlwE'if (m/uid: (\S{41})/) { say $_ }'
/etc/grafana/provisioning/alerting/dashboard-WDworker40.yaml:    uid: qa_network_infra_ping_time_alert_worker40
/etc/grafana/provisioning/alerting/dashboard-WDworker40.yaml:    uid: too_many_minion_job_failures_alert_worker40
/etc/grafana/provisioning/alerting/dashboard-WDworker33.yaml:    uid: qa_network_infra_ping_time_alert_worker33
/etc/grafana/provisioning/alerting/dashboard-WDworker33.yaml:    uid: too_many_minion_job_failures_alert_worker33
/etc/grafana/provisioning/alerting/dashboard-WDsapworker1.yaml:    uid: qa_network_infra_ping_time_alert_sapworker1
/etc/grafana/provisioning/alerting/dashboard-WDsapworker1.yaml:    uid: too_many_minion_job_failures_alert_sapworker1
/etc/grafana/provisioning/alerting/dashboard-WDworker31.yaml:    uid: qa_network_infra_ping_time_alert_worker31
/etc/grafana/provisioning/alerting/dashboard-WDworker31.yaml:    uid: too_many_minion_job_failures_alert_worker31
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: openqa_ping_time_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: too_many_minion_job_failures_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: partitions_usage_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm2.yaml:    uid: qa_network_infra_ping_time_alert_worker-arm2
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm2.yaml:    uid: too_many_minion_job_failures_alert_worker-arm2
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg6.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg6
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg6.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg6
/etc/grafana/provisioning/alerting/dashboard-WDsapworker2.yaml:    uid: qa_network_infra_ping_time_alert_sapworker2
/etc/grafana/provisioning/alerting/dashboard-WDsapworker2.yaml:    uid: too_many_minion_job_failures_alert_sapworker2
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker17.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker17
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker17.yaml:    uid: too_many_minion_job_failures_alert_openqaworker17
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker18.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker18
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker18.yaml:    uid: too_many_minion_job_failures_alert_openqaworker18
/etc/grafana/provisioning/alerting/dashboard-WDimagetester.yaml:    uid: qa_network_infra_ping_time_alert_imagetester
/etc/grafana/provisioning/alerting/dashboard-WDimagetester.yaml:    uid: too_many_minion_job_failures_alert_imagetester
/etc/grafana/provisioning/alerting/dashboard-WDgrenache-1.yaml:    uid: qa_network_infra_ping_time_alert_grenache-1
/etc/grafana/provisioning/alerting/dashboard-WDgrenache-1.yaml:    uid: too_many_minion_job_failures_alert_grenache-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker14.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker14
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker14.yaml:    uid: too_many_minion_job_failures_alert_openqaworker14
/etc/grafana/provisioning/alerting/dashboard-WDworker35.yaml:    uid: qa_network_infra_ping_time_alert_worker35
/etc/grafana/provisioning/alerting/dashboard-WDworker35.yaml:    uid: too_many_minion_job_failures_alert_worker35
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg5.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg5
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg5.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg5
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg7.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg7
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg7.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg7
/etc/grafana/provisioning/alerting/dashboard-WDpetrol.yaml:    uid: too_many_minion_job_failures_alert_petrol
/etc/grafana/provisioning/alerting/dashboard-WDdiesel.yaml:    uid: too_many_minion_job_failures_alert_diesel
/etc/grafana/provisioning/alerting/dashboard-WDworker32.yaml:    uid: qa_network_infra_ping_time_alert_worker32
/etc/grafana/provisioning/alerting/dashboard-WDworker32.yaml:    uid: too_many_minion_job_failures_alert_worker32
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: ssl_expiration_alert_openqa.oqa.prg2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: san_validity_alert_openqa.oqa.prg2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: ssl_expiration_alert_monitor.qe.nue2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: san_validity_alert_monitor.qe.nue2.suse.org
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker16.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker16
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker16.yaml:    uid: too_many_minion_job_failures_alert_openqaworker16
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker1.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker1.yaml:    uid: too_many_minion_job_failures_alert_openqaworker1
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg4.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg4
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg4.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg4
/etc/grafana/provisioning/alerting/dashboard-WDworker34.yaml:    uid: qa_network_infra_ping_time_alert_worker34
/etc/grafana/provisioning/alerting/dashboard-WDworker34.yaml:    uid: too_many_minion_job_failures_alert_worker34
/etc/grafana/provisioning/alerting/dashboard-WDworker30.yaml:    uid: qa_network_infra_ping_time_alert_worker30
/etc/grafana/provisioning/alerting/dashboard-WDworker30.yaml:    uid: too_many_minion_job_failures_alert_worker30
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm1.yaml:    uid: qa_network_infra_ping_time_alert_worker-arm1
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm1.yaml:    uid: too_many_minion_job_failures_alert_worker-arm1
/etc/grafana/provisioning/alerting/dashboard-WDsapworker3.yaml:    uid: qa_network_infra_ping_time_alert_sapworker3
/etc/grafana/provisioning/alerting/dashboard-WDsapworker3.yaml:    uid: too_many_minion_job_failures_alert_sapworker3
/etc/grafana/provisioning/alerting/dashboard-WDworker29.yaml:    uid: qa_network_infra_ping_time_alert_worker29
/etc/grafana/provisioning/alerting/dashboard-WDworker29.yaml:    uid: too_many_minion_job_failures_alert_worker29
Actions #3

Updated by okurz 7 months ago

  • Description updated (diff)
  • Status changed from In Progress to New
  • Assignee deleted (okurz)
  • Priority changed from Urgent to High

The problem has only been triggered on s390zl12 as nicksinger added the role "worker" to the host s390zl12 for #159066 causing grafana trying to recreate those alerts, all others still exist in the database and grafana would only complain for new alerts. By now I have also reverted the addition of the worker role to s390zl12 so two levels of workaround are now applied. We need to still continue with the other steps to prevent problems with new alerts in the future.

Actions #4

Updated by okurz 7 months ago

  • Assignee set to nicksinger

as decided with nicksinger in Jitsi talk.

Actions #5

Updated by okurz 7 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1185 for further reducing the length as was necessary, e.g. for openqaworker-arm-1, a rather long hostname.

Actions #6

Updated by nicksinger 7 months ago

Removing the old alerts with the old UID from the DB was necessary to avoid a duplication because the title of the alert is the same. We used the following manually on monitor:

monitor:/etc/grafana/provisioning/alerting # RULE_UID=qa_network_infra_alert_; sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
>   delete from alert_rule where uid like '${RULE_UID}%';
>   delete from alert_rule_version where rule_uid like '${RULE_UID}%';
>   delete from alert_instance where rule_uid like '${RULE_UID}%';
>   delete from provenance_type where record_key like '${RULE_UID}%';
>   delete from annotation where text like '%${RULE_UID}%';
> "
monitor:/etc/grafana/provisioning/alerting # RULE_UID=too_many_minion_alert_; sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
>   delete from alert_rule where uid like '${RULE_UID}%';
>   delete from alert_rule_version where rule_uid like '${RULE_UID}%';
>   delete from alert_instance where rule_uid like '${RULE_UID}%';
>   delete from provenance_type where record_key like '${RULE_UID}%';
>   delete from annotation where text like '%${RULE_UID}%';
> "
Actions #7

Updated by livdywan 7 months ago

  • Subject changed from grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" to grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M
  • Status changed from New to Workable

We discussed the ticket, and estimated it without changing the description

Actions #9

Updated by okurz 7 months ago

We tried out with a temporary alert

and the export looks like this:

apiVersion: 1
groups:
    - orgId: 1
      name: test-quick
      folder: WIP
      interval: 10s
      rules:
        - uid: ab735516-b49e-4ce8-bee8-b2ebbdd1c6f5
          title: Test annotation
          condition: B
          data:
            - refId: A
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: "000000001"
              model:
                alias: $tag_url
                datasource:
                    type: influxdb
                    uid: "000000001"
                intervalMs: 1000
                maxDataPoints: 43200
                query: SELECT mean("average_response_ms") FROM "ping" WHERE ("host" = 'openqa') AND $timeFilter GROUP BY time($__interval), "url" fill(null)
                rawQuery: true
                refId: A
                resultFormat: time_series
            - refId: B
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 5
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - A
                      reducer:
                        params: []
                        type: max
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: ""
                intervalMs: 1000
                maxDataPoints: 43200
                refId: B
                type: classic_conditions
          noDataState: NoData
          execErrState: Error
          for: 10s
          annotations:
            summary: |-
                The following machines were not pingable for several minutes:
                {{ range $k, $v := $values -}}
                {{ if (match "B[0-9]+" $k) -}}
                * {{ $v.Labels }}{{ end }}
                {{ end }}

                Suggested actions:
                * foo
                * bar
          labels:
            __contacts__: Private message to nobody
          isPaused: true
        - uid: aed2400b-df4e-4374-b338-79f780436d68
          title: test for uid generation with much longer string length so to see if it abbreviates or hashes or something
          condition: C
          data:
            - refId: A
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: "000000001"
              model:
                intervalMs: 1000
                maxDataPoints: 43200
                refId: A
            - refId: B
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params: []
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - B
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                reducer: last
                refId: B
                type: reduce
            - refId: C
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - C
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: B
                intervalMs: 1000
                maxDataPoints: 43200
                refId: C
                type: threshold
          noDataState: NoData
          execErrState: Error
          for: 5m
          annotations:
            description: ""
            runbook_url: ""
            summary: ""
          labels:
            "": ""
          isPaused: false

we don't have a rule_uid in there, just "uid" with a seemingly generated hash. Maybe we can just deploy with having that key removed.

Actions #10

Updated by okurz 7 months ago

  • Priority changed from High to Normal
Actions #11

Updated by okurz 7 months ago

  • Status changed from Workable to In Progress
Actions #12

Updated by openqa_review 7 months ago

  • Due date set to 2024-06-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by nicksinger 6 months ago

  • Status changed from In Progress to Feedback

Checked our files again and realized only the generic- and the worker-dashboard contain manual UIDs written by us. I used :%s/\(\s\+\)uid:\s\+\(.*\){{\s*\(.*\)\s*}}$/\1uid: {{ (('\2' + \3) | sha512[:40] }}/g in vim to generate: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211

Actions #14

Updated by okurz 6 months ago

  • Due date changed from 2024-06-20 to 2024-07-04
  • Status changed from Feedback to Workable

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211 has a mention about deleting older UIDs so back to "Workable"

Actions #15

Updated by livdywan 6 months ago

  • Due date deleted (2024-07-04)
Actions #16

Updated by nicksinger 5 months ago

  • Status changed from Workable to Feedback

Creating deletion rules according to doc ( https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/file-provisioning/#import-alert-rules ) with:

(echo 'apiVersion: 1'; echo 'deleteRules:'; ssh root@monitor.qe.nue2.suse.org '(grep -ri " uid:" /etc/grafana/provisioning/alerting/dashboard-GD* | cut -d ":" -f 3-; grep -ri "^\s\{4\}uid:" /etc/grafana/provisioning/alerting/dashboard-WD* | cut -d: -f 3-)' | xargs -I{} bash -c "echo -e '  - orgId: 1';echo '    uid: {}'") | yq

and added them to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211

Actions #17

Updated by nicksinger 5 months ago

Unfortunately I noticed that the deployment (silently) failed. First I had to fix the salt-minion on monitor which was not properly responding to master requests, then had to follow up with an syntax error introduced by me: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1223

Deploying the new worker and generic alerts worked then but grafana took a very long time to restart resulting in a systemd timeout and an endless loop of restarting (I guess it could never complete the whole deletion and recreation routine in time). Manually starting grafana-server as user grafana allowed me then to further progress revealing duplicate entries but no further details in the log.
Moving out all "WD" and "GD"-files made grafana start up so the cleanup of https://progress.opensuse.org/issues/160284#note-16 was not sufficient. In that state I was able to export a list from the grafana webui of all currently deployed alerts. I then generated a list of the currently disabled alerts on monitor with cat *WD* | grep "^ name:".

Combining these lists gave me a list of left-over, old alerts which conflict with the new alerts containing the hash as UID:

while read p; do cat alert-rules-1720458432594.yaml | yq -r '(.groups[] | select(.name == "'"$p"'")).rules[].uid'; done <<(cat should_be_gone.txt | cut -d \' -f 2)
system_load_alert_grenache-1
system_load_alert_imagetester
system_load_alert_mania
system_load_alert_openqaworker14
system_load_alert_openqaworker16
system_load_alert_openqaworker17
system_load_alert_openqaworker18
system_load_alert_openqaworker1
system_load_alert_openqaworker-arm-1
system_load_alert_petrol
system_load_alert_qesapworker-prg4
system_load_alert_qesapworker-prg5
system_load_alert_qesapworker-prg6
system_load_alert_qesapworker-prg7
system_load_alert_sapworker1
system_load_alert_sapworker2
system_load_alert_sapworker3
system_load_alert_worker29
system_load_alert_worker30
system_load_alert_worker31
system_load_alert_worker32
system_load_alert_worker33
system_load_alert_worker34
system_load_alert_worker35
system_load_alert_worker40
system_load_alert_worker-arm1
system_load_alert_worker-arm2

After deleting these alerts and moving back the new alert definitions, I restarted grafana once more and also salt-minion to apply a final highstate again with the migration partly done manually. This succeeded.
Final cleanup is done with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1224

Actions #18

Updated by nicksinger 5 months ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF