Project

General

Profile

Actions

action #127982

closed

[alert][grafana] DatasourceNoData e.g. openQA QA-Power8-5-kvm+openqaw5-xen Disk I/O time alert worker

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-04-19
Due date:
% Done:

0%

Estimated time:
Tags:


Related issues 1 (0 open1 closed)

Copied to openQA Infrastructure - action #127985: [alert][grafana] Handle various problems with slow monitoring https requests and such (was: DatasourceNoData HTTP Response alert) size:MResolvedmkittler2023-04-192023-05-04

Actions
Actions #1

Updated by okurz over 1 year ago

  • Description updated (diff)
  • Priority changed from Urgent to High

I added a silence for disk_io_time alerts

Actions #2

Updated by okurz over 1 year ago

  • Copied to action #127985: [alert][grafana] Handle various problems with slow monitoring https requests and such (was: DatasourceNoData HTTP Response alert) size:M added
Actions #3

Updated by okurz over 1 year ago

  • Subject changed from [alert][grafana] DatasourceNoData openQA QA-Power8-5-kvm+openqaw5-xen Disk I/O time alert worker to [alert][grafana] DatasourceNoData e.g. openQA QA-Power8-5-kvm+openqaw5-xen Disk I/O time alert worker
  • Status changed from New to Feedback
  • Assignee set to okurz
Actions #6

Updated by okurz over 1 year ago

  • Status changed from Feedback to In Progress

We still have alerts from openqaw5 likely because that machine changed role from worker to generic. Maybe need to delete something manually

Actions #7

Updated by okurz over 1 year ago

  • Status changed from In Progress to Feedback

Over https://monitor.qa.suse.de/dashboards?query=openqaw5-xen I found two dashboards, deleted the "worker" one. Also on monitor.qa.suse.de I deleted /etc/grafana/provisioning/alerting/dashboard-WDopenqaw5-xen.yaml and restarted grafana. There was still a alert in https://monitor.qa.suse.de/d/GDopenqaw5-xen/dashboard-for-openqaw5-xen?viewPanel=12054 about memory usage and it showed "type=worker" but while looking at it after about 1m the alert rule disappeared and so did the firing alert leaving only the type=generic alert. A bit weird but ok for now. No more alert related emails seen since then. Monitoring the situation.

Actions #8

Updated by okurz over 1 year ago

  • Due date deleted (2023-05-05)
  • Status changed from Feedback to Resolved

Unsilenced three alerts about "NoData". No further alerts observed.

Actions #9

Updated by okurz over 1 year ago

  • Due date set to 2023-05-05
  • Status changed from Resolved to In Progress

there still seem to be alerts even though the old dashboard files have been deleted. I will look into what is stored in the database.

Actions #10

Updated by okurz over 1 year ago

I could not find anything related in /var/log/grafana/grafana.log

Upstream issue reports:

There is a recommendation to "delete those alerts from the alert table in the database.". On monitor.qa.suse.de I called

sqlite3 -readonly /var/lib/grafana/grafana.db

and called select * from alert; but have not seen any results. select * from alert_instance where current_state != 'Normal'; shows:

rule_org_id|rule_uid|labels|labels_hash|current_state|current_state_since|last_eval_time|current_state_end|current_reason
1|qa_network_infra_ping_time_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","qa_network_infra_ping_time_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: QA network infrastructure Ping time alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A"],["rule_uid","qa_network_infra_ping_time_alert_openqaw5-xen"],["type","worker"]]|13e37a8ad3270e1caddf0e47dde609f16b338bf8|NoData|1682214240|1682415660|1682415840|
1|openqa_ping_time_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","openqa_ping_time_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: OpenQA Ping time alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A"],["rule_uid","openqa_ping_time_alert_openqaw5-xen"],["type","worker"]]|51e00bfb83a79c51989dbcc488f874804085d08a|NoData|1682214240|1682415660|1682415840|
1|too_many_minion_job_failures_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","too_many_minion_job_failures_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: Too many Minion job failures alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A,C"],["rule_uid","too_many_minion_job_failures_alert_openqaw5-xen"],["type","worker"]]|aa04534bffe2d7804ea14896b59000d7fc1fcc1d|NoData|1682214240|1682415660|1682415840|
1|download_rate_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","download_rate_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: Download rate alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","download_count,download_rate"],["rule_uid","download_rate_alert_openqaw5-xen"],["type","worker"]]|821812b7c3a050d2fbc3aff9842ef17dbed25ce1|NoData|1682214240|1682415660|1682415840|

Trying to narrow down with select * from alert_instance where current_state != 'Normal' and rule_uid like '%openqaw5-xen' and labels like '%type","worker%';

rule_org_id|rule_uid|labels|labels_hash|current_state|current_state_since|last_eval_time|current_state_end|current_reason
1|qa_network_infra_ping_time_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","qa_network_infra_ping_time_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: QA network infrastructure Ping time alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A"],["rule_uid","qa_network_infra_ping_time_alert_openqaw5-xen"],["type","worker"]]|13e37a8ad3270e1caddf0e47dde609f16b338bf8|NoData|1682214240|1682415960|1682416140|
1|openqa_ping_time_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","openqa_ping_time_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: OpenQA Ping time alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A"],["rule_uid","openqa_ping_time_alert_openqaw5-xen"],["type","worker"]]|51e00bfb83a79c51989dbcc488f874804085d08a|NoData|1682214240|1682415960|1682416140|
1|too_many_minion_job_failures_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","too_many_minion_job_failures_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: Too many Minion job failures alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","A,C"],["rule_uid","too_many_minion_job_failures_alert_openqaw5-xen"],["type","worker"]]|aa04534bffe2d7804ea14896b59000d7fc1fcc1d|NoData|1682214240|1682415960|1682416140|
1|download_rate_alert_openqaw5-xen|[["__alert_rule_namespace_uid__","N4bPksB4k"],["__alert_rule_uid__","download_rate_alert_openqaw5-xen"],["__contacts__","\"osd-admins\""],["alertname","openqaw5-xen: Download rate alert"],["datasource_uid","000000001"],["grafana_folder","openQA"],["hostname","openqaw5-xen"],["ref_id","download_count,download_rate"],["rule_uid","download_rate_alert_openqaw5-xen"],["type","worker"]]|821812b7c3a050d2fbc3aff9842ef17dbed25ce1|NoData|1682214240|1682415960|1682416140|

I ensured that the backup of the grafana db is up-to-date:

$ ssh backup.qa.suse.de "ls -ltra /home/rsnapshot/alpha.0/openqa-monitor.qa.suse.de/var/lib/grafana/grafana.db"
-rw-r--r-- 1 brltty sgx 143151104 Apr 25 11:50 /home/rsnapshot/alpha.0/openqa-monitor.qa.suse.de/var/lib/grafana/grafana.db

Then mkittler told me we can follow https://gitlab.suse.de/openqa/salt-states-openqa#removing-stale-provisioned-alerts

I prepared a SQL delete statement and execute it while grafana is down and then restart grafana

systemctl stop grafana-server && sqlite3 /var/lib/grafana/grafana.db 'delete from alert_rule where labels like "%openqaw5-xen%" and labels like "%type%worker%"; delete from alert_instance where current_state != "Normal" and rule_uid like "%openqaw5-xen" and labels like "%type%worker%"; delete from alert_rule_version where rule_uid like "%openqaw5-xen" and labels like "%type%worker%"; delete from provenance_type where record_key like "%openqaw5-xen";' && sync && systemctl start grafana-server

executed that, looks ok so far.

Actions #11

Updated by okurz over 1 year ago

  • Status changed from In Progress to Feedback

will wait if something shows up again.

Actions #12

Updated by okurz over 1 year ago

  • Due date deleted (2023-05-05)
  • Status changed from Feedback to Resolved

No further problems observed, confirmed by nsinger and mkittler

Actions

Also available in: Atom PDF