action #160284: grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #160284

closed

grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M

Added by okurz 7 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-05-13

Due date:

% Done:

Estimated time:

Tags:

alert, grafana, infra

Description

Observation¶

https://monitor.qa.suse.de/ yields

502 Bad Gateway

From

journalctl -u grafana-server

May 13 12:04:54 monitor grafana[28845]: cannot create rule with UID 'qa_network_infra_ping_time_alert_s390zl12': UID is longer than 40 symbols
…
May 13 12:05:31 monitor grafana[29160]: cannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols

the alerts are defined from monitor:/etc/grafana/provisioning/alerting/dashboard-WDs390zl12.yaml . I temporarily changed that string locally from "too_many_minion_job_failures_alert_s390zl12" to "too_many_minion_job_failures_s390zl12" and for the other respectively. So
apparently only those two strings are problematic?

Acceptance criteria¶

AC1: grafana starts up consistently again
AC2: static code checks prevent us from running into the same problem before merging MRs

Suggestions¶

DONE Fix the problem transiently
DONE Research upstream for the problem. Maybe a new automatic grafana version upgrade triggered this? -> The feature change happened in https://github.com/grafana/grafana/commit/99fd7b8141e9cec296b810760ec0e86136ebfca0 2023-09 so some time aftwards we got the new version including this but haven't added problematically long alerts since then.
Understand why only the two strings mentioned in the observation pose a problem
Fix the problem in salt-states-openqa for all UIDs
Add a CI called check for UID length

Actions

Copy link

Updated by okurz 7 months ago

Description updated (diff)
Status changed from New to In Progress
Assignee set to okurz

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1183 for the mitigation

Actions

Copy link

Updated by tinita 7 months ago

What's weird is that we have more existing uids with more than 41 characters, but grafana doesn't complain anymore:

% grep -r " uid:" /etc/grafana/ | perl -nlwE'if (m/uid: (\S{41})/) { say $_ }'
/etc/grafana/provisioning/alerting/dashboard-WDworker40.yaml:    uid: qa_network_infra_ping_time_alert_worker40
/etc/grafana/provisioning/alerting/dashboard-WDworker40.yaml:    uid: too_many_minion_job_failures_alert_worker40
/etc/grafana/provisioning/alerting/dashboard-WDworker33.yaml:    uid: qa_network_infra_ping_time_alert_worker33
/etc/grafana/provisioning/alerting/dashboard-WDworker33.yaml:    uid: too_many_minion_job_failures_alert_worker33
/etc/grafana/provisioning/alerting/dashboard-WDsapworker1.yaml:    uid: qa_network_infra_ping_time_alert_sapworker1
/etc/grafana/provisioning/alerting/dashboard-WDsapworker1.yaml:    uid: too_many_minion_job_failures_alert_sapworker1
/etc/grafana/provisioning/alerting/dashboard-WDworker31.yaml:    uid: qa_network_infra_ping_time_alert_worker31
/etc/grafana/provisioning/alerting/dashboard-WDworker31.yaml:    uid: too_many_minion_job_failures_alert_worker31
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: openqa_ping_time_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: too_many_minion_job_failures_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml:    uid: partitions_usage_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm2.yaml:    uid: qa_network_infra_ping_time_alert_worker-arm2
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm2.yaml:    uid: too_many_minion_job_failures_alert_worker-arm2
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg6.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg6
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg6.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg6
/etc/grafana/provisioning/alerting/dashboard-WDsapworker2.yaml:    uid: qa_network_infra_ping_time_alert_sapworker2
/etc/grafana/provisioning/alerting/dashboard-WDsapworker2.yaml:    uid: too_many_minion_job_failures_alert_sapworker2
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker17.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker17
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker17.yaml:    uid: too_many_minion_job_failures_alert_openqaworker17
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker18.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker18
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker18.yaml:    uid: too_many_minion_job_failures_alert_openqaworker18
/etc/grafana/provisioning/alerting/dashboard-WDimagetester.yaml:    uid: qa_network_infra_ping_time_alert_imagetester
/etc/grafana/provisioning/alerting/dashboard-WDimagetester.yaml:    uid: too_many_minion_job_failures_alert_imagetester
/etc/grafana/provisioning/alerting/dashboard-WDgrenache-1.yaml:    uid: qa_network_infra_ping_time_alert_grenache-1
/etc/grafana/provisioning/alerting/dashboard-WDgrenache-1.yaml:    uid: too_many_minion_job_failures_alert_grenache-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker14.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker14
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker14.yaml:    uid: too_many_minion_job_failures_alert_openqaworker14
/etc/grafana/provisioning/alerting/dashboard-WDworker35.yaml:    uid: qa_network_infra_ping_time_alert_worker35
/etc/grafana/provisioning/alerting/dashboard-WDworker35.yaml:    uid: too_many_minion_job_failures_alert_worker35
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg5.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg5
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg5.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg5
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg7.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg7
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg7.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg7
/etc/grafana/provisioning/alerting/dashboard-WDpetrol.yaml:    uid: too_many_minion_job_failures_alert_petrol
/etc/grafana/provisioning/alerting/dashboard-WDdiesel.yaml:    uid: too_many_minion_job_failures_alert_diesel
/etc/grafana/provisioning/alerting/dashboard-WDworker32.yaml:    uid: qa_network_infra_ping_time_alert_worker32
/etc/grafana/provisioning/alerting/dashboard-WDworker32.yaml:    uid: too_many_minion_job_failures_alert_worker32
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: ssl_expiration_alert_openqa.oqa.prg2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: san_validity_alert_openqa.oqa.prg2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: ssl_expiration_alert_monitor.qe.nue2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml:    uid: san_validity_alert_monitor.qe.nue2.suse.org
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker16.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker16
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker16.yaml:    uid: too_many_minion_job_failures_alert_openqaworker16
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker1.yaml:    uid: qa_network_infra_ping_time_alert_openqaworker1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker1.yaml:    uid: too_many_minion_job_failures_alert_openqaworker1
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg4.yaml:    uid: qa_network_infra_ping_time_alert_qesapworker-prg4
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg4.yaml:    uid: too_many_minion_job_failures_alert_qesapworker-prg4
/etc/grafana/provisioning/alerting/dashboard-WDworker34.yaml:    uid: qa_network_infra_ping_time_alert_worker34
/etc/grafana/provisioning/alerting/dashboard-WDworker34.yaml:    uid: too_many_minion_job_failures_alert_worker34
/etc/grafana/provisioning/alerting/dashboard-WDworker30.yaml:    uid: qa_network_infra_ping_time_alert_worker30
/etc/grafana/provisioning/alerting/dashboard-WDworker30.yaml:    uid: too_many_minion_job_failures_alert_worker30
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm1.yaml:    uid: qa_network_infra_ping_time_alert_worker-arm1
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm1.yaml:    uid: too_many_minion_job_failures_alert_worker-arm1
/etc/grafana/provisioning/alerting/dashboard-WDsapworker3.yaml:    uid: qa_network_infra_ping_time_alert_sapworker3
/etc/grafana/provisioning/alerting/dashboard-WDsapworker3.yaml:    uid: too_many_minion_job_failures_alert_sapworker3
/etc/grafana/provisioning/alerting/dashboard-WDworker29.yaml:    uid: qa_network_infra_ping_time_alert_worker29
/etc/grafana/provisioning/alerting/dashboard-WDworker29.yaml:    uid: too_many_minion_job_failures_alert_worker29

Actions

Copy link

Updated by okurz 7 months ago

Description updated (diff)
Status changed from In Progress to New
Assignee deleted (~~okurz~~)
Priority changed from Urgent to High

The problem has only been triggered on s390zl12 as nicksinger added the role "worker" to the host s390zl12 for #159066 causing grafana trying to recreate those alerts, all others still exist in the database and grafana would only complain for new alerts. By now I have also reverted the addition of the worker role to s390zl12 so two levels of workaround are now applied. We need to still continue with the other steps to prevent problems with new alerts in the future.

Actions

Copy link

Updated by okurz 7 months ago

Assignee set to nicksinger

as decided with nicksinger in Jitsi talk.

Actions

Copy link

Updated by okurz 7 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1185 for further reducing the length as was necessary, e.g. for openqaworker-arm-1, a rather long hostname.

Actions

Copy link

Updated by nicksinger 7 months ago

Removing the old alerts with the old UID from the DB was necessary to avoid a duplication because the title of the alert is the same. We used the following manually on monitor:

monitor:/etc/grafana/provisioning/alerting # RULE_UID=qa_network_infra_alert_; sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
>   delete from alert_rule where uid like '${RULE_UID}%';
>   delete from alert_rule_version where rule_uid like '${RULE_UID}%';
>   delete from alert_instance where rule_uid like '${RULE_UID}%';
>   delete from provenance_type where record_key like '${RULE_UID}%';
>   delete from annotation where text like '%${RULE_UID}%';
> "
monitor:/etc/grafana/provisioning/alerting # RULE_UID=too_many_minion_alert_; sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
>   delete from alert_rule where uid like '${RULE_UID}%';
>   delete from alert_rule_version where rule_uid like '${RULE_UID}%';
>   delete from alert_instance where rule_uid like '${RULE_UID}%';
>   delete from provenance_type where record_key like '${RULE_UID}%';
>   delete from annotation where text like '%${RULE_UID}%';
> "

Actions

Copy link

Updated by livdywan 7 months ago

Subject changed from grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" to grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M
Status changed from New to Workable

We discussed the ticket, and estimated it without changing the description

Actions

Copy link

Updated by okurz 7 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1194

Actions

Copy link

Updated by okurz 7 months ago

We tried out with a temporary alert

and the export looks like this:

apiVersion: 1
groups:
    - orgId: 1
      name: test-quick
      folder: WIP
      interval: 10s
      rules:
        - uid: ab735516-b49e-4ce8-bee8-b2ebbdd1c6f5
          title: Test annotation
          condition: B
          data:
            - refId: A
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: "000000001"
              model:
                alias: $tag_url
                datasource:
                    type: influxdb
                    uid: "000000001"
                intervalMs: 1000
                maxDataPoints: 43200
                query: SELECT mean("average_response_ms") FROM "ping" WHERE ("host" = 'openqa') AND $timeFilter GROUP BY time($__interval), "url" fill(null)
                rawQuery: true
                refId: A
                resultFormat: time_series
            - refId: B
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 5
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - A
                      reducer:
                        params: []
                        type: max
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: ""
                intervalMs: 1000
                maxDataPoints: 43200
                refId: B
                type: classic_conditions
          noDataState: NoData
          execErrState: Error
          for: 10s
          annotations:
            summary: |-
                The following machines were not pingable for several minutes:
                {{ range $k, $v := $values -}}
                {{ if (match "B[0-9]+" $k) -}}
                * {{ $v.Labels }}{{ end }}
                {{ end }}

                Suggested actions:
                * foo
                * bar
          labels:
            __contacts__: Private message to nobody
          isPaused: true
        - uid: aed2400b-df4e-4374-b338-79f780436d68
          title: test for uid generation with much longer string length so to see if it abbreviates or hashes or something
          condition: C
          data:
            - refId: A
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: "000000001"
              model:
                intervalMs: 1000
                maxDataPoints: 43200
                refId: A
            - refId: B
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params: []
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - B
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                reducer: last
                refId: B
                type: reduce
            - refId: C
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - C
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: B
                intervalMs: 1000
                maxDataPoints: 43200
                refId: C
                type: threshold
          noDataState: NoData
          execErrState: Error
          for: 5m
          annotations:
            description: ""
            runbook_url: ""
            summary: ""
          labels:
            "": ""
          isPaused: false

we don't have a rule_uid in there, just "uid" with a seemingly generated hash. Maybe we can just deploy with having that key removed.

Actions

Copy link

#10

Updated by okurz 7 months ago

Priority changed from High to Normal

Actions

Copy link

#11

Updated by okurz 7 months ago

Status changed from Workable to In Progress

Actions

Copy link

#12

Updated by openqa_review 7 months ago

Due date set to 2024-06-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#13

Updated by nicksinger 6 months ago

Status changed from In Progress to Feedback

Checked our files again and realized only the generic- and the worker-dashboard contain manual UIDs written by us. I used :%s/$\s\+$uid:\s\+$.*${{\s*$.*$\s*}}$/\1uid: {{ (('\2' + \3) | sha512[:40] }}/g in vim to generate: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211

Actions

Copy link

#14

Updated by okurz 6 months ago

Due date changed from 2024-06-20 to 2024-07-04
Status changed from Feedback to Workable

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211 has a mention about deleting older UIDs so back to "Workable"

Actions

Copy link

#15

Updated by livdywan 6 months ago

Due date deleted (~~2024-07-04~~)

Actions

Copy link

#16

Updated by nicksinger 5 months ago

Status changed from Workable to Feedback

Creating deletion rules according to doc ( https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/file-provisioning/#import-alert-rules ) with:

(echo 'apiVersion: 1'; echo 'deleteRules:'; ssh root@monitor.qe.nue2.suse.org '(grep -ri " uid:" /etc/grafana/provisioning/alerting/dashboard-GD* | cut -d ":" -f 3-; grep -ri "^\s\{4\}uid:" /etc/grafana/provisioning/alerting/dashboard-WD* | cut -d: -f 3-)' | xargs -I{} bash -c "echo -e '  - orgId: 1';echo '    uid: {}'") | yq

and added them to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211

Actions

Copy link

#17

Updated by nicksinger 5 months ago

Unfortunately I noticed that the deployment (silently) failed. First I had to fix the salt-minion on monitor which was not properly responding to master requests, then had to follow up with an syntax error introduced by me: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1223

Deploying the new worker and generic alerts worked then but grafana took a very long time to restart resulting in a systemd timeout and an endless loop of restarting (I guess it could never complete the whole deletion and recreation routine in time). Manually starting grafana-server as user grafana allowed me then to further progress revealing duplicate entries but no further details in the log.
Moving out all "WD" and "GD"-files made grafana start up so the cleanup of https://progress.opensuse.org/issues/160284#note-16 was not sufficient. In that state I was able to export a list from the grafana webui of all currently deployed alerts. I then generated a list of the currently disabled alerts on monitor with cat *WD* | grep "^ name:".

Combining these lists gave me a list of left-over, old alerts which conflict with the new alerts containing the hash as UID:

while read p; do cat alert-rules-1720458432594.yaml | yq -r '(.groups[] | select(.name == "'"$p"'")).rules[].uid'; done <<(cat should_be_gone.txt | cut -d \' -f 2)
system_load_alert_grenache-1
system_load_alert_imagetester
system_load_alert_mania
system_load_alert_openqaworker14
system_load_alert_openqaworker16
system_load_alert_openqaworker17
system_load_alert_openqaworker18
system_load_alert_openqaworker1
system_load_alert_openqaworker-arm-1
system_load_alert_petrol
system_load_alert_qesapworker-prg4
system_load_alert_qesapworker-prg5
system_load_alert_qesapworker-prg6
system_load_alert_qesapworker-prg7
system_load_alert_sapworker1
system_load_alert_sapworker2
system_load_alert_sapworker3
system_load_alert_worker29
system_load_alert_worker30
system_load_alert_worker31
system_load_alert_worker32
system_load_alert_worker33
system_load_alert_worker34
system_load_alert_worker35
system_load_alert_worker40
system_load_alert_worker-arm1
system_load_alert_worker-arm2

After deleting these alerts and moving back the new alert definitions, I restarted grafana once more and also salt-minion to apply a final highstate again with the migration partly done manually. This succeeded.
Final cleanup is done with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1224

Actions

Copy link

#18

Updated by nicksinger 5 months ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #160284

grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 7 months ago

Updated by tinita 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by livdywan 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by openqa_review 7 months ago

Updated by nicksinger 6 months ago

Updated by okurz 6 months ago

Updated by livdywan 6 months ago

Updated by nicksinger 5 months ago

Updated by nicksinger 5 months ago

Updated by nicksinger 5 months ago