action #137522
closed
[alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M
Added by tinita 8 months ago.
Updated 7 months ago.
Description
Observations¶
Fri, 06 Oct 2023 09:16:02 +0200
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?orgId=1
It seems to be all host: sushil-linux-tw-kde
Suggestions¶
- Likely sushil just sends data over telegraf to our grafana instance. Prevent that!
- Investigate where the list of machines we check here is taken from
- Introduce an additional telegraf data tag to our salt-controlled machines and adjust grafana queries/alerts to match this tag
- In queries/panels to only show "our" hosts
- In the alerts (maybe? Do we want to provide alerts for others as well?)
- In the notification channels to only receive mails for hosts we care about
Out of scope¶
- Confirm why it is allowed to push telegraf data from anywhere - should/can this be dropped?
- Is there going to be a lot of (big) data unaccounted for?
Rollback actions¶
- Remove pause for
host=sushil-linux-tw-kde
- Description updated (diff)
- Target version set to Ready
- Priority changed from Normal to Urgent
- Related to action #137519: [alert] Failed systemd services - openqaworker1 - proc-sys-fs-binfmt_misc.mount, kernel modules already removed with old kernel still running size:M added
- Description updated (diff)
- Priority changed from Urgent to High
Added silence and rollback action
- Subject changed from [alert] Inode utilization inside the OSD infrastructure is too high to [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high
- Description updated (diff)
- Subject changed from [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high to [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M
- Description updated (diff)
- Status changed from New to Workable
- Assignee set to jbaier_cz
- Status changed from Workable to In Progress
I can confirm the initial assumption. Just for the record, first data point is from 2023-10-06 08:45:40.000; our monitoring/grafana/alerting/inodes.yaml is taking any data from autogen.disk and acts upon them. There is currently no list of machines for this (and similar) alert so we probably want to make sure we tag "our" data.
- Due date set to 2023-10-24
Setting due date based on mean cycle time of SUSE QE Tools
- Due date deleted (
2023-10-24)
- Status changed from In Progress to Resolved
Rollback actions done, no new alerts so far.
Also available in: Atom
PDF