Project

General

Profile

Actions

action #160284

open

grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M

Added by okurz about 1 month ago. Updated 10 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-13
Due date:
2024-06-20 (Due in 4 days)
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/ yields

502 Bad Gateway

From

journalctl -u grafana-server
May 13 12:04:54 monitor grafana[28845]: cannot create rule with UID 'qa_network_infra_ping_time_alert_s390zl12': UID is longer than 40 symbols
…
May 13 12:05:31 monitor grafana[29160]: cannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols

the alerts are defined from monitor:/etc/grafana/provisioning/alerting/dashboard-WDs390zl12.yaml . I temporarily changed that string locally from "too_many_minion_job_failures_alert_s390zl12" to "too_many_minion_job_failures_s390zl12" and for the other respectively. So
apparently only those two strings are problematic?

Acceptance criteria

  • AC1: grafana starts up consistently again
  • AC2: static code checks prevent us from running into the same problem before merging MRs

Suggestions

  • DONE Fix the problem transiently
  • DONE Research upstream for the problem. Maybe a new automatic grafana version upgrade triggered this? -> The feature change happened in https://github.com/grafana/grafana/commit/99fd7b8141e9cec296b810760ec0e86136ebfca0 2023-09 so some time aftwards we got the new version including this but haven't added problematically long alerts since then.
  • Understand why only the two strings mentioned in the observation pose a problem
  • Fix the problem in salt-states-openqa for all UIDs
  • Add a CI called check for UID length
Actions #1

Updated by okurz about 1 month ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #2

Updated by tinita about 1 month ago

What's weird is that we have more existing uids with more than 41 characters, but grafana doesn't complain anymore:

% grep -r " uid:" /etc/grafana/ | perl -nlwE'if (m/uid: (\S{41})/) { say $_ }'
/etc/grafana/provisioning/alerting/dashboard-WDworker40.yaml:    uid: qa_network_infra_ping_time_alert_worker40
/etc/grafana/provisioning/alerting/dashboard-WDworker40.yaml:    uid: too_many_minion_job_failures_alert_worker40
/etc/grafana/provisioning/alerting/dashboard-WDworker33.yaml:    uid: qa_network_infra_ping_time_alert_worker33
/etc/grafana/provisioning/alerting/dashboard-WDworker33.yaml:    uid: too_many_minion_job_failures_alert_worker33
/etc/grafana/provisioning/alerting/dashboard-WDsapworker1.yaml:    uid: qa_network_infra_ping_time_alert_sapworker1
/etc/grafana/provisioning/alerting/dashboard-WDsapworker1.yaml:    uid: too_many_minion_job_failures_alert_sapworker1
/etc/grafana/provisioning/alerting/dashboard-WDworker31.yaml:    uid: qa_network_infra_ping_time_alert_worker31
/etc/grafana/provisioning/alerting/dashboard-WDworker31.yaml:    uid: too_many_minion_job_failures_alert_worker31
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: openqa_ping_time_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: too_many_minion_job_failures_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: partitions_usage_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm2.yaml:    uid: qa_network_infra_ping_time_alert_worker-arm2
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm2.yaml:    uid: too_many_minion_job_failures_alert_worker-arm2
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg6.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg6
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg6.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg6
/etc/grafana/provisioning/alerting/dashboard-WDsapworker2.yaml:    uid: qa_network_infra_ping_time_alert_sapworker2
/etc/grafana/provisioning/alerting/dashboard-WDsapworker2.yaml:    uid: too_many_minion_job_failures_alert_sapworker2
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker17.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker17
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker17.yaml:    uid: too_many_minion_job_failures_alert_openqaworker17
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker18.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker18
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker18.yaml:    uid: too_many_minion_job_failures_alert_openqaworker18
/etc/grafana/provisioning/alerting/dashboard-WDimagetester.yaml:    uid: qa_network_infra_ping_time_alert_imagetester
/etc/grafana/provisioning/alerting/dashboard-WDimagetester.yaml:    uid: too_many_minion_job_failures_alert_imagetester
/etc/grafana/provisioning/alerting/dashboard-WDgrenache-1.yaml:    uid: qa_network_infra_ping_time_alert_grenache-1
/etc/grafana/provisioning/alerting/dashboard-WDgrenache-1.yaml:    uid: too_many_minion_job_failures_alert_grenache-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker14.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker14
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker14.yaml:    uid: too_many_minion_job_failures_alert_openqaworker14
/etc/grafana/provisioning/alerting/dashboard-WDworker35.yaml:    uid: qa_network_infra_ping_time_alert_worker35
/etc/grafana/provisioning/alerting/dashboard-WDworker35.yaml:    uid: too_many_minion_job_failures_alert_worker35
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg5.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg5
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg5.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg5
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg7.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg7
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg7.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg7
/etc/grafana/provisioning/alerting/dashboard-WDpetrol.yaml:    uid: too_many_minion_job_failures_alert_petrol
/etc/grafana/provisioning/alerting/dashboard-WDdiesel.yaml:    uid: too_many_minion_job_failures_alert_diesel
/etc/grafana/provisioning/alerting/dashboard-WDworker32.yaml:    uid: qa_network_infra_ping_time_alert_worker32
/etc/grafana/provisioning/alerting/dashboard-WDworker32.yaml:    uid: too_many_minion_job_failures_alert_worker32
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: ssl_expiration_alert_openqa.oqa.prg2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: san_validity_alert_openqa.oqa.prg2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: ssl_expiration_alert_monitor.qe.nue2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: san_validity_alert_monitor.qe.nue2.suse.org
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker16.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker16
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker16.yaml:    uid: too_many_minion_job_failures_alert_openqaworker16
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker1.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker1.yaml:    uid: too_many_minion_job_failures_alert_openqaworker1
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg4.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg4
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg4.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg4
/etc/grafana/provisioning/alerting/dashboard-WDworker34.yaml:    uid: qa_network_infra_ping_time_alert_worker34
/etc/grafana/provisioning/alerting/dashboard-WDworker34.yaml:    uid: too_many_minion_job_failures_alert_worker34
/etc/grafana/provisioning/alerting/dashboard-WDworker30.yaml:    uid: qa_network_infra_ping_time_alert_worker30
/etc/grafana/provisioning/alerting/dashboard-WDworker30.yaml:    uid: too_many_minion_job_failures_alert_worker30
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm1.yaml:    uid: qa_network_infra_ping_time_alert_worker-arm1
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm1.yaml:    uid: too_many_minion_job_failures_alert_worker-arm1
/etc/grafana/provisioning/alerting/dashboard-WDsapworker3.yaml:    uid: qa_network_infra_ping_time_alert_sapworker3
/etc/grafana/provisioning/alerting/dashboard-WDsapworker3.yaml:    uid: too_many_minion_job_failures_alert_sapworker3
/etc/grafana/provisioning/alerting/dashboard-WDworker29.yaml:    uid: qa_network_infra_ping_time_alert_worker29
/etc/grafana/provisioning/alerting/dashboard-WDworker29.yaml:    uid: too_many_minion_job_failures_alert_worker29
Actions #3

Updated by okurz about 1 month ago

  • Description updated (diff)
  • Status changed from In Progress to New
  • Assignee deleted (okurz)
  • Priority changed from Urgent to High

The problem has only been triggered on s390zl12 as nicksinger added the role "worker" to the host s390zl12 for #159066 causing grafana trying to recreate those alerts, all others still exist in the database and grafana would only complain for new alerts. By now I have also reverted the addition of the worker role to s390zl12 so two levels of workaround are now applied. We need to still continue with the other steps to prevent problems with new alerts in the future.

Actions #4

Updated by okurz about 1 month ago

  • Assignee set to nicksinger

as decided with nicksinger in Jitsi talk.

Actions #5

Updated by okurz about 1 month ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1185 for further reducing the length as was necessary, e.g. for openqaworker-arm-1, a rather long hostname.

Actions #6

Updated by nicksinger about 1 month ago

Removing the old alerts with the old UID from the DB was necessary to avoid a duplication because the title of the alert is the same. We used the following manually on monitor:

monitor:/etc/grafana/provisioning/alerting # RULE_UID=qa_network_infra_alert_; sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
>   delete from alert_rule where uid like '${RULE_UID}%';
>   delete from alert_rule_version where rule_uid like '${RULE_UID}%';
>   delete from alert_instance where rule_uid like '${RULE_UID}%';
>   delete from provenance_type where record_key like '${RULE_UID}%';
>   delete from annotation where text like '%${RULE_UID}%';
> "
monitor:/etc/grafana/provisioning/alerting # RULE_UID=too_many_minion_alert_; sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
>   delete from alert_rule where uid like '${RULE_UID}%';
>   delete from alert_rule_version where rule_uid like '${RULE_UID}%';
>   delete from alert_instance where rule_uid like '${RULE_UID}%';
>   delete from provenance_type where record_key like '${RULE_UID}%';
>   delete from annotation where text like '%${RULE_UID}%';
> "
Actions #7

Updated by livdywan about 1 month ago

  • Subject changed from grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" to grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M
  • Status changed from New to Workable

We discussed the ticket, and estimated it without changing the description

Actions #9

Updated by okurz 30 days ago

We tried out with a temporary alert

and the export looks like this:

apiVersion: 1
groups:
    - orgId: 1
      name: test-quick
      folder: WIP
      interval: 10s
      rules:
        - uid: ab735516-b49e-4ce8-bee8-b2ebbdd1c6f5
          title: Test annotation
          condition: B
          data:
            - refId: A
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: "000000001"
              model:
                alias: $tag_url
                datasource:
                    type: influxdb
                    uid: "000000001"
                intervalMs: 1000
                maxDataPoints: 43200
                query: SELECT mean("average_response_ms") FROM "ping" WHERE ("host" = 'openqa') AND $timeFilter GROUP BY time($__interval), "url" fill(null)
                rawQuery: true
                refId: A
                resultFormat: time_series
            - refId: B
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 5
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - A
                      reducer:
                        params: []
                        type: max
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: ""
                intervalMs: 1000
                maxDataPoints: 43200
                refId: B
                type: classic_conditions
          noDataState: NoData
          execErrState: Error
          for: 10s
          annotations:
            summary: |-
                The following machines were not pingable for several minutes:
                {{ range $k, $v := $values -}}
                {{ if (match "B[0-9]+" $k) -}}
                * {{ $v.Labels }}{{ end }}
                {{ end }}

                Suggested actions:
                * foo
                * bar
          labels:
            __contacts__: Private message to nobody
          isPaused: true
        - uid: aed2400b-df4e-4374-b338-79f780436d68
          title: test for uid generation with much longer string length so to see if it abbreviates or hashes or something
          condition: C
          data:
            - refId: A
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: "000000001"
              model:
                intervalMs: 1000
                maxDataPoints: 43200
                refId: A
            - refId: B
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params: []
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - B
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                reducer: last
                refId: B
                type: reduce
            - refId: C
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - C
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: B
                intervalMs: 1000
                maxDataPoints: 43200
                refId: C
                type: threshold
          noDataState: NoData
          execErrState: Error
          for: 5m
          annotations:
            description: ""
            runbook_url: ""
            summary: ""
          labels:
            "": ""
          isPaused: false

we don't have a rule_uid in there, just "uid" with a seemingly generated hash. Maybe we can just deploy with having that key removed.

Actions #10

Updated by okurz 30 days ago

  • Priority changed from High to Normal
Actions #11

Updated by okurz 11 days ago

  • Status changed from Workable to In Progress
Actions #12

Updated by openqa_review 10 days ago

  • Due date set to 2024-06-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Also available in: Atom PDF