Project

General

Profile

Actions

action #137522

closed

[alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M

Added by tinita 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-10-06
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observations

Fri, 06 Oct 2023 09:16:02 +0200
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?orgId=1
It seems to be all host: sushil-linux-tw-kde

Suggestions

  • Likely sushil just sends data over telegraf to our grafana instance. Prevent that!
  • Investigate where the list of machines we check here is taken from
  • Introduce an additional telegraf data tag to our salt-controlled machines and adjust grafana queries/alerts to match this tag
  • In queries/panels to only show "our" hosts
  • In the alerts (maybe? Do we want to provide alerts for others as well?)
  • In the notification channels to only receive mails for hosts we care about

Out of scope

  • Confirm why it is allowed to push telegraf data from anywhere - should/can this be dropped?
  • Is there going to be a lot of (big) data unaccounted for?

Rollback actions

  • Remove pause for host=sushil-linux-tw-kde

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #137519: [alert] Failed systemd services - openqaworker1 - proc-sys-fs-binfmt_misc.mount, kernel modules already removed with old kernel still running size:MResolvednicksinger2023-10-062023-10-21

Actions
Actions

Also available in: Atom PDF