action #127982
closed[alert][grafana] DatasourceNoData e.g. openQA QA-Power8-5-kvm+openqaw5-xen Disk I/O time alert worker
0%
Description
Updated by okurz over 1 year ago
- Description updated (diff)
- Priority changed from Urgent to High
I added a silence for disk_io_time alerts
Updated by okurz over 1 year ago
- Copied to action #127985: [alert][grafana] Handle various problems with slow monitoring https requests and such (was: DatasourceNoData HTTP Response alert) size:M added
Updated by okurz over 1 year ago
- Subject changed from [alert][grafana] DatasourceNoData openQA QA-Power8-5-kvm+openqaw5-xen Disk I/O time alert worker to [alert][grafana] DatasourceNoData e.g. openQA QA-Power8-5-kvm+openqaw5-xen Disk I/O time alert worker
- Status changed from New to Feedback
- Assignee set to okurz
Updated by okurz over 1 year ago
Updated by okurz over 1 year ago
- Due date set to 2023-05-05
Updated by okurz over 1 year ago
- Status changed from Feedback to In Progress
We still have alerts from openqaw5 likely because that machine changed role from worker to generic. Maybe need to delete something manually
Updated by okurz over 1 year ago
- Status changed from In Progress to Feedback
Over https://monitor.qa.suse.de/dashboards?query=openqaw5-xen I found two dashboards, deleted the "worker" one. Also on monitor.qa.suse.de I deleted /etc/grafana/provisioning/alerting/dashboard-WDopenqaw5-xen.yaml and restarted grafana. There was still a alert in https://monitor.qa.suse.de/d/GDopenqaw5-xen/dashboard-for-openqaw5-xen?viewPanel=12054 about memory usage and it showed "type=worker" but while looking at it after about 1m the alert rule disappeared and so did the firing alert leaving only the type=generic alert. A bit weird but ok for now. No more alert related emails seen since then. Monitoring the situation.
Updated by okurz over 1 year ago
- Due date deleted (
2023-05-05) - Status changed from Feedback to Resolved
Unsilenced three alerts about "NoData". No further alerts observed.
Updated by okurz over 1 year ago
- Due date set to 2023-05-05
- Status changed from Resolved to In Progress
there still seem to be alerts even though the old dashboard files have been deleted. I will look into what is stored in the database.
Updated by okurz over 1 year ago
I could not find anything related in /var/log/grafana/grafana.log
Upstream issue reports:
- https://community.grafana.com/t/alert-still-running-dashboard-deleted/6116
- https://github.com/grafana/grafana/issues/10037
- https://community.grafana.com/t/deleted-grafana-alert-still-firing/82870
There is a recommendation to "delete those alerts from the alert table in the database.". On monitor.qa.suse.de I called
sqlite3 -readonly /var/lib/grafana/grafana.db
and called select * from alert;
but have not seen any results. select * from alert_instance where current_state != 'Normal';
shows:
rule_org_id|rule_uid|labels|labels_hash|current_state|current_state_since|last_eval_time|current_state_end|current_reason
1|qa_network_infra_ping_time_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","qa_network_infra_ping_time_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: QA network infrastructure Ping time alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A"],["rule_uid","qa_network_infra_ping_time_alert_openqaw5-xen"],["type","worker"]]|13e37a8ad3270e1caddf0e47dde609f16b338bf8|NoData|1682214240|1682415660|1682415840|
1|openqa_ping_time_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","openqa_ping_time_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: OpenQA Ping time alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A"],["rule_uid","openqa_ping_time_alert_openqaw5-xen"],["type","worker"]]|51e00bfb83a79c51989dbcc488f874804085d08a|NoData|1682214240|1682415660|1682415840|
1|too_many_minion_job_failures_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","too_many_minion_job_failures_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: Too many Minion job failures alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A,C"],["rule_uid","too_many_minion_job_failures_alert_openqaw5-xen"],["type","worker"]]|aa04534bffe2d7804ea14896b59000d7fc1fcc1d|NoData|1682214240|1682415660|1682415840|
1|download_rate_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","download_rate_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: Download rate alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","download_count,download_rate"],["rule_uid","download_rate_alert_openqaw5-xen"],["type","worker"]]|821812b7c3a050d2fbc3aff9842ef17dbed25ce1|NoData|1682214240|1682415660|1682415840|
Trying to narrow down with select * from alert_instance where current_state != 'Normal' and rule_uid like '%openqaw5-xen' and labels like '%type","worker%';
rule_org_id|rule_uid|labels|labels_hash|current_state|current_state_since|last_eval_time|current_state_end|current_reason
1|qa_network_infra_ping_time_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","qa_network_infra_ping_time_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: QA network infrastructure Ping time alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A"],["rule_uid","qa_network_infra_ping_time_alert_openqaw5-xen"],["type","worker"]]|13e37a8ad3270e1caddf0e47dde609f16b338bf8|NoData|1682214240|1682415960|1682416140|
1|openqa_ping_time_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","openqa_ping_time_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: OpenQA Ping time alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A"],["rule_uid","openqa_ping_time_alert_openqaw5-xen"],["type","worker"]]|51e00bfb83a79c51989dbcc488f874804085d08a|NoData|1682214240|1682415960|1682416140|
1|too_many_minion_job_failures_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","too_many_minion_job_failures_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: Too many Minion job failures alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A,C"],["rule_uid","too_many_minion_job_failures_alert_openqaw5-xen"],["type","worker"]]|aa04534bffe2d7804ea14896b59000d7fc1fcc1d|NoData|1682214240|1682415960|1682416140|
1|download_rate_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","download_rate_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: Download rate alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","download_count,download_rate"],["rule_uid","download_rate_alert_openqaw5-xen"],["type","worker"]]|821812b7c3a050d2fbc3aff9842ef17dbed25ce1|NoData|1682214240|1682415960|1682416140|
I ensured that the backup of the grafana db is up-to-date:
$ ssh backup.qa.suse.de "ls -ltra /home/rsnapshot/alpha.0/openqa-monitor.qa.suse.de/var/lib/grafana/grafana.db"
-rw-r--r-- 1 brltty sgx 143151104 Apr 25 11:50 /home/rsnapshot/alpha.0/openqa-monitor.qa.suse.de/var/lib/grafana/grafana.db
Then mkittler told me we can follow https://gitlab.suse.de/openqa/salt-states-openqa#removing-stale-provisioned-alerts
I prepared a SQL delete statement and execute it while grafana is down and then restart grafana
systemctl stop grafana-server && sqlite3 /var/lib/grafana/grafana.db 'delete from alert_rule where labels like "%openqaw5-xen%" and labels like "%type%worker%"; delete from alert_instance where current_state != "Normal" and rule_uid like "%openqaw5-xen" and labels like "%type%worker%"; delete from alert_rule_version where rule_uid like "%openqaw5-xen" and labels like "%type%worker%"; delete from provenance_type where record_key like "%openqaw5-xen";' && sync && systemctl start grafana-server
executed that, looks ok so far.
Updated by okurz over 1 year ago
- Status changed from In Progress to Feedback
will wait if something shows up again.
Updated by okurz over 1 year ago
- Due date deleted (
2023-05-05) - Status changed from Feedback to Resolved
No further problems observed, confirmed by nsinger and mkittler