action #160284
closedgrafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M
0%
Description
Observation¶
https://monitor.qa.suse.de/ yields
502 Bad Gateway
From
journalctl -u grafana-server
May 13 12:04:54 monitor grafana[28845]: cannot create rule with UID 'qa_network_infra_ping_time_alert_s390zl12': UID is longer than 40 symbols
…
May 13 12:05:31 monitor grafana[29160]: cannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols
the alerts are defined from monitor:/etc/grafana/provisioning/alerting/dashboard-WDs390zl12.yaml . I temporarily changed that string locally from "too_many_minion_job_failures_alert_s390zl12" to "too_many_minion_job_failures_s390zl12" and for the other respectively. So
apparently only those two strings are problematic?
Acceptance criteria¶
- AC1: grafana starts up consistently again
- AC2: static code checks prevent us from running into the same problem before merging MRs
Suggestions¶
- DONE Fix the problem transiently
- DONE Research upstream for the problem. Maybe a new automatic grafana version upgrade triggered this? -> The feature change happened in https://github.com/grafana/grafana/commit/99fd7b8141e9cec296b810760ec0e86136ebfca0 2023-09 so some time aftwards we got the new version including this but haven't added problematically long alerts since then.
- Understand why only the two strings mentioned in the observation pose a problem
- Fix the problem in salt-states-openqa for all UIDs
- Add a CI called check for UID length
Updated by tinita 7 months ago
What's weird is that we have more existing uids with more than 41 characters, but grafana doesn't complain anymore:
% grep -r " uid:" /etc/grafana/ | perl -nlwE'if (m/uid: (\S{41})/) { say $_ }'
/etc/grafana/provisioning/alerting/dashboard-WDworker40.yaml: uid: qa_network_infra_ping_time_alert_worker40
/etc/grafana/provisioning/alerting/dashboard-WDworker40.yaml: uid: too_many_minion_job_failures_alert_worker40
/etc/grafana/provisioning/alerting/dashboard-WDworker33.yaml: uid: qa_network_infra_ping_time_alert_worker33
/etc/grafana/provisioning/alerting/dashboard-WDworker33.yaml: uid: too_many_minion_job_failures_alert_worker33
/etc/grafana/provisioning/alerting/dashboard-WDsapworker1.yaml: uid: qa_network_infra_ping_time_alert_sapworker1
/etc/grafana/provisioning/alerting/dashboard-WDsapworker1.yaml: uid: too_many_minion_job_failures_alert_sapworker1
/etc/grafana/provisioning/alerting/dashboard-WDworker31.yaml: uid: qa_network_infra_ping_time_alert_worker31
/etc/grafana/provisioning/alerting/dashboard-WDworker31.yaml: uid: too_many_minion_job_failures_alert_worker31
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml: uid: openqa_ping_time_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml: uid: qa_network_infra_ping_time_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml: uid: too_many_minion_job_failures_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker-arm-1.yaml: uid: partitions_usage_alert_openqaworker-arm-1
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm2.yaml: uid: qa_network_infra_ping_time_alert_worker-arm2
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm2.yaml: uid: too_many_minion_job_failures_alert_worker-arm2
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg6.yaml: uid: qa_network_infra_ping_time_alert_qesapworker-prg6
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg6.yaml: uid: too_many_minion_job_failures_alert_qesapworker-prg6
/etc/grafana/provisioning/alerting/dashboard-WDsapworker2.yaml: uid: qa_network_infra_ping_time_alert_sapworker2
/etc/grafana/provisioning/alerting/dashboard-WDsapworker2.yaml: uid: too_many_minion_job_failures_alert_sapworker2
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker17.yaml: uid: qa_network_infra_ping_time_alert_openqaworker17
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker17.yaml: uid: too_many_minion_job_failures_alert_openqaworker17
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker18.yaml: uid: qa_network_infra_ping_time_alert_openqaworker18
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker18.yaml: uid: too_many_minion_job_failures_alert_openqaworker18
/etc/grafana/provisioning/alerting/dashboard-WDimagetester.yaml: uid: qa_network_infra_ping_time_alert_imagetester
/etc/grafana/provisioning/alerting/dashboard-WDimagetester.yaml: uid: too_many_minion_job_failures_alert_imagetester
/etc/grafana/provisioning/alerting/dashboard-WDgrenache-1.yaml: uid: qa_network_infra_ping_time_alert_grenache-1
/etc/grafana/provisioning/alerting/dashboard-WDgrenache-1.yaml: uid: too_many_minion_job_failures_alert_grenache-1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker14.yaml: uid: qa_network_infra_ping_time_alert_openqaworker14
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker14.yaml: uid: too_many_minion_job_failures_alert_openqaworker14
/etc/grafana/provisioning/alerting/dashboard-WDworker35.yaml: uid: qa_network_infra_ping_time_alert_worker35
/etc/grafana/provisioning/alerting/dashboard-WDworker35.yaml: uid: too_many_minion_job_failures_alert_worker35
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg5.yaml: uid: qa_network_infra_ping_time_alert_qesapworker-prg5
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg5.yaml: uid: too_many_minion_job_failures_alert_qesapworker-prg5
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg7.yaml: uid: qa_network_infra_ping_time_alert_qesapworker-prg7
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg7.yaml: uid: too_many_minion_job_failures_alert_qesapworker-prg7
/etc/grafana/provisioning/alerting/dashboard-WDpetrol.yaml: uid: too_many_minion_job_failures_alert_petrol
/etc/grafana/provisioning/alerting/dashboard-WDdiesel.yaml: uid: too_many_minion_job_failures_alert_diesel
/etc/grafana/provisioning/alerting/dashboard-WDworker32.yaml: uid: qa_network_infra_ping_time_alert_worker32
/etc/grafana/provisioning/alerting/dashboard-WDworker32.yaml: uid: too_many_minion_job_failures_alert_worker32
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml: uid: ssl_expiration_alert_openqa.oqa.prg2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml: uid: san_validity_alert_openqa.oqa.prg2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml: uid: ssl_expiration_alert_monitor.qe.nue2.suse.org
/etc/grafana/provisioning/alerting/dashboard-certificates.yaml: uid: san_validity_alert_monitor.qe.nue2.suse.org
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker16.yaml: uid: qa_network_infra_ping_time_alert_openqaworker16
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker16.yaml: uid: too_many_minion_job_failures_alert_openqaworker16
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker1.yaml: uid: qa_network_infra_ping_time_alert_openqaworker1
/etc/grafana/provisioning/alerting/dashboard-WDopenqaworker1.yaml: uid: too_many_minion_job_failures_alert_openqaworker1
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg4.yaml: uid: qa_network_infra_ping_time_alert_qesapworker-prg4
/etc/grafana/provisioning/alerting/dashboard-WDqesapworker-prg4.yaml: uid: too_many_minion_job_failures_alert_qesapworker-prg4
/etc/grafana/provisioning/alerting/dashboard-WDworker34.yaml: uid: qa_network_infra_ping_time_alert_worker34
/etc/grafana/provisioning/alerting/dashboard-WDworker34.yaml: uid: too_many_minion_job_failures_alert_worker34
/etc/grafana/provisioning/alerting/dashboard-WDworker30.yaml: uid: qa_network_infra_ping_time_alert_worker30
/etc/grafana/provisioning/alerting/dashboard-WDworker30.yaml: uid: too_many_minion_job_failures_alert_worker30
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm1.yaml: uid: qa_network_infra_ping_time_alert_worker-arm1
/etc/grafana/provisioning/alerting/dashboard-WDworker-arm1.yaml: uid: too_many_minion_job_failures_alert_worker-arm1
/etc/grafana/provisioning/alerting/dashboard-WDsapworker3.yaml: uid: qa_network_infra_ping_time_alert_sapworker3
/etc/grafana/provisioning/alerting/dashboard-WDsapworker3.yaml: uid: too_many_minion_job_failures_alert_sapworker3
/etc/grafana/provisioning/alerting/dashboard-WDworker29.yaml: uid: qa_network_infra_ping_time_alert_worker29
/etc/grafana/provisioning/alerting/dashboard-WDworker29.yaml: uid: too_many_minion_job_failures_alert_worker29
Updated by okurz 7 months ago
- Description updated (diff)
- Status changed from In Progress to New
- Assignee deleted (
okurz) - Priority changed from Urgent to High
The problem has only been triggered on s390zl12 as nicksinger added the role "worker" to the host s390zl12 for #159066 causing grafana trying to recreate those alerts, all others still exist in the database and grafana would only complain for new alerts. By now I have also reverted the addition of the worker role to s390zl12 so two levels of workaround are now applied. We need to still continue with the other steps to prevent problems with new alerts in the future.
Updated by okurz 7 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1185 for further reducing the length as was necessary, e.g. for openqaworker-arm-1, a rather long hostname.
Updated by nicksinger 7 months ago
Removing the old alerts with the old UID from the DB was necessary to avoid a duplication because the title of the alert is the same. We used the following manually on monitor:
monitor:/etc/grafana/provisioning/alerting # RULE_UID=qa_network_infra_alert_; sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
> delete from alert_rule where uid like '${RULE_UID}%';
> delete from alert_rule_version where rule_uid like '${RULE_UID}%';
> delete from alert_instance where rule_uid like '${RULE_UID}%';
> delete from provenance_type where record_key like '${RULE_UID}%';
> delete from annotation where text like '%${RULE_UID}%';
> "
monitor:/etc/grafana/provisioning/alerting # RULE_UID=too_many_minion_alert_; sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
> delete from alert_rule where uid like '${RULE_UID}%';
> delete from alert_rule_version where rule_uid like '${RULE_UID}%';
> delete from alert_instance where rule_uid like '${RULE_UID}%';
> delete from provenance_type where record_key like '${RULE_UID}%';
> delete from annotation where text like '%${RULE_UID}%';
> "
Updated by livdywan 7 months ago
- Subject changed from grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" to grafana server fails to start due to "alert rules: invalid alert rule\ncannot create rule with UID 'too_many_minion_job_failures_alert_s390zl12': UID is longer than 40 symbols" size:M
- Status changed from New to Workable
We discussed the ticket, and estimated it without changing the description
Updated by okurz 7 months ago
We tried out with a temporary alert
and the export looks like this:
apiVersion: 1
groups:
- orgId: 1
name: test-quick
folder: WIP
interval: 10s
rules:
- uid: ab735516-b49e-4ce8-bee8-b2ebbdd1c6f5
title: Test annotation
condition: B
data:
- refId: A
relativeTimeRange:
from: 600
to: 0
datasourceUid: "000000001"
model:
alias: $tag_url
datasource:
type: influxdb
uid: "000000001"
intervalMs: 1000
maxDataPoints: 43200
query: SELECT mean("average_response_ms") FROM "ping" WHERE ("host" = 'openqa') AND $timeFilter GROUP BY time($__interval), "url" fill(null)
rawQuery: true
refId: A
resultFormat: time_series
- refId: B
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params:
- 5
- 0
type: gt
operator:
type: and
query:
params:
- A
reducer:
params: []
type: max
type: query
datasource:
name: Expression
type: __expr__
uid: __expr__
expression: ""
intervalMs: 1000
maxDataPoints: 43200
refId: B
type: classic_conditions
noDataState: NoData
execErrState: Error
for: 10s
annotations:
summary: |-
The following machines were not pingable for several minutes:
{{ range $k, $v := $values -}}
{{ if (match "B[0-9]+" $k) -}}
* {{ $v.Labels }}{{ end }}
{{ end }}
Suggested actions:
* foo
* bar
labels:
__contacts__: Private message to nobody
isPaused: true
- uid: aed2400b-df4e-4374-b338-79f780436d68
title: test for uid generation with much longer string length so to see if it abbreviates or hashes or something
condition: C
data:
- refId: A
relativeTimeRange:
from: 600
to: 0
datasourceUid: "000000001"
model:
intervalMs: 1000
maxDataPoints: 43200
refId: A
- refId: B
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: []
type: gt
operator:
type: and
query:
params:
- B
reducer:
params: []
type: last
type: query
datasource:
type: __expr__
uid: __expr__
expression: A
intervalMs: 1000
maxDataPoints: 43200
reducer: last
refId: B
type: reduce
- refId: C
datasourceUid: __expr__
model:
conditions:
- evaluator:
params:
- 0
type: gt
operator:
type: and
query:
params:
- C
reducer:
params: []
type: last
type: query
datasource:
type: __expr__
uid: __expr__
expression: B
intervalMs: 1000
maxDataPoints: 43200
refId: C
type: threshold
noDataState: NoData
execErrState: Error
for: 5m
annotations:
description: ""
runbook_url: ""
summary: ""
labels:
"": ""
isPaused: false
we don't have a rule_uid in there, just "uid" with a seemingly generated hash. Maybe we can just deploy with having that key removed.
Updated by openqa_review 7 months ago
- Due date set to 2024-06-20
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 6 months ago
- Status changed from In Progress to Feedback
Checked our files again and realized only the generic- and the worker-dashboard contain manual UIDs written by us. I used :%s/\(\s\+\)uid:\s\+\(.*\){{\s*\(.*\)\s*}}$/\1uid: {{ (('\2' + \3) | sha512[:40] }}/g
in vim to generate: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211
Updated by okurz 6 months ago
- Due date changed from 2024-06-20 to 2024-07-04
- Status changed from Feedback to Workable
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211 has a mention about deleting older UIDs so back to "Workable"
Updated by nicksinger 5 months ago
- Status changed from Workable to Feedback
Creating deletion rules according to doc ( https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/file-provisioning/#import-alert-rules ) with:
(echo 'apiVersion: 1'; echo 'deleteRules:'; ssh root@monitor.qe.nue2.suse.org '(grep -ri " uid:" /etc/grafana/provisioning/alerting/dashboard-GD* | cut -d ":" -f 3-; grep -ri "^\s\{4\}uid:" /etc/grafana/provisioning/alerting/dashboard-WD* | cut -d: -f 3-)' | xargs -I{} bash -c "echo -e ' - orgId: 1';echo ' uid: {}'") | yq
and added them to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1211
Updated by nicksinger 5 months ago
Unfortunately I noticed that the deployment (silently) failed. First I had to fix the salt-minion on monitor which was not properly responding to master requests, then had to follow up with an syntax error introduced by me: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1223
Deploying the new worker and generic alerts worked then but grafana took a very long time to restart resulting in a systemd timeout and an endless loop of restarting (I guess it could never complete the whole deletion and recreation routine in time). Manually starting grafana-server as user grafana allowed me then to further progress revealing duplicate entries but no further details in the log.
Moving out all "WD" and "GD"-files made grafana start up so the cleanup of https://progress.opensuse.org/issues/160284#note-16 was not sufficient. In that state I was able to export a list from the grafana webui of all currently deployed alerts. I then generated a list of the currently disabled alerts on monitor with cat *WD* | grep "^ name:"
.
Combining these lists gave me a list of left-over, old alerts which conflict with the new alerts containing the hash as UID:
while read p; do cat alert-rules-1720458432594.yaml | yq -r '(.groups[] | select(.name == "'"$p"'")).rules[].uid'; done <<(cat should_be_gone.txt | cut -d \' -f 2)
system_load_alert_grenache-1
system_load_alert_imagetester
system_load_alert_mania
system_load_alert_openqaworker14
system_load_alert_openqaworker16
system_load_alert_openqaworker17
system_load_alert_openqaworker18
system_load_alert_openqaworker1
system_load_alert_openqaworker-arm-1
system_load_alert_petrol
system_load_alert_qesapworker-prg4
system_load_alert_qesapworker-prg5
system_load_alert_qesapworker-prg6
system_load_alert_qesapworker-prg7
system_load_alert_sapworker1
system_load_alert_sapworker2
system_load_alert_sapworker3
system_load_alert_worker29
system_load_alert_worker30
system_load_alert_worker31
system_load_alert_worker32
system_load_alert_worker33
system_load_alert_worker34
system_load_alert_worker35
system_load_alert_worker40
system_load_alert_worker-arm1
system_load_alert_worker-arm2
After deleting these alerts and moving back the new alert definitions, I restarted grafana once more and also salt-minion to apply a final highstate again with the migration partly done manually. This succeeded.
Final cleanup is done with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1224