action #137522
closed[alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M
0%
Description
Observations¶
Fri, 06 Oct 2023 09:16:02 +0200
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?orgId=1
It seems to be all host: sushil-linux-tw-kde
Suggestions¶
- Likely sushil just sends data over telegraf to our grafana instance. Prevent that!
- Investigate where the list of machines we check here is taken from
- Introduce an additional telegraf data tag to our salt-controlled machines and adjust grafana queries/alerts to match this tag
- In queries/panels to only show "our" hosts
- In the alerts (maybe? Do we want to provide alerts for others as well?)
- In the notification channels to only receive mails for hosts we care about
Out of scope¶
- Confirm why it is allowed to push telegraf data from anywhere - should/can this be dropped?
- Is there going to be a lot of (big) data unaccounted for?
Rollback actions¶
- Remove pause for
host=sushil-linux-tw-kde
Updated by okurz about 1 year ago
- Related to action #137519: [alert] Failed systemd services - openqaworker1 - proc-sys-fs-binfmt_misc.mount, kernel modules already removed with old kernel still running size:M added
Updated by okurz about 1 year ago
- Description updated (diff)
- Priority changed from Urgent to High
Added silence and rollback action
Updated by okurz about 1 year ago
- Subject changed from [alert] Inode utilization inside the OSD infrastructure is too high to [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high
- Description updated (diff)
Updated by livdywan about 1 year ago
- Subject changed from [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high to [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz about 1 year ago
- Status changed from Workable to In Progress
I can confirm the initial assumption. Just for the record, first data point is from 2023-10-06 08:45:40.000; our monitoring/grafana/alerting/inodes.yaml is taking any data from autogen.disk and acts upon them. There is currently no list of machines for this (and similar) alert so we probably want to make sure we tag "our" data.
Updated by jbaier_cz about 1 year ago
We can apply global tag to salted telegraf: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1008
Updated by openqa_review about 1 year ago
- Due date set to 2023-10-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz about 1 year ago
Now, we are also using the tag inside the alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1009
Final step, remove the pause and see if that helped.
Updated by jbaier_cz about 1 year ago
- Due date deleted (
2023-10-24) - Status changed from In Progress to Resolved
Rollback actions done, no new alerts so far.