Project

General

Profile

Actions

action #137522

closed

[alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M

Added by tinita about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-10-06
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observations

Fri, 06 Oct 2023 09:16:02 +0200
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?orgId=1
It seems to be all host: sushil-linux-tw-kde

Suggestions

  • Likely sushil just sends data over telegraf to our grafana instance. Prevent that!
  • Investigate where the list of machines we check here is taken from
  • Introduce an additional telegraf data tag to our salt-controlled machines and adjust grafana queries/alerts to match this tag
  • In queries/panels to only show "our" hosts
  • In the alerts (maybe? Do we want to provide alerts for others as well?)
  • In the notification channels to only receive mails for hosts we care about

Out of scope

  • Confirm why it is allowed to push telegraf data from anywhere - should/can this be dropped?
  • Is there going to be a lot of (big) data unaccounted for?

Rollback actions

  • Remove pause for host=sushil-linux-tw-kde

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #137519: [alert] Failed systemd services - openqaworker1 - proc-sys-fs-binfmt_misc.mount, kernel modules already removed with old kernel still running size:MResolvednicksinger2023-10-062023-10-21

Actions
Actions #1

Updated by tinita about 1 year ago

  • Description updated (diff)
Actions #2

Updated by okurz about 1 year ago

  • Target version set to Ready
Actions #3

Updated by okurz about 1 year ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by okurz about 1 year ago

  • Related to action #137519: [alert] Failed systemd services - openqaworker1 - proc-sys-fs-binfmt_misc.mount, kernel modules already removed with old kernel still running size:M added
Actions #5

Updated by okurz about 1 year ago

  • Description updated (diff)
  • Priority changed from Urgent to High

Added silence and rollback action

Actions #6

Updated by okurz about 1 year ago

  • Subject changed from [alert] Inode utilization inside the OSD infrastructure is too high to [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high
  • Description updated (diff)
Actions #7

Updated by livdywan about 1 year ago

  • Subject changed from [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high to [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by jbaier_cz about 1 year ago

  • Assignee set to jbaier_cz
Actions #9

Updated by jbaier_cz about 1 year ago

  • Status changed from Workable to In Progress

I can confirm the initial assumption. Just for the record, first data point is from 2023-10-06 08:45:40.000; our monitoring/grafana/alerting/inodes.yaml is taking any data from autogen.disk and acts upon them. There is currently no list of machines for this (and similar) alert so we probably want to make sure we tag "our" data.

Actions #10

Updated by jbaier_cz about 1 year ago

Actions #11

Updated by openqa_review about 1 year ago

  • Due date set to 2023-10-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by jbaier_cz about 1 year ago

Now, we are also using the tag inside the alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1009

Final step, remove the pause and see if that helped.

Actions #13

Updated by jbaier_cz about 1 year ago

  • Due date deleted (2023-10-24)
  • Status changed from In Progress to Resolved

Rollback actions done, no new alerts so far.

Actions

Also available in: Atom PDF