Project

General

Profile

Actions

action #137522

closed

[alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M

Added by tinita 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-10-06
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observations

Fri, 06 Oct 2023 09:16:02 +0200
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?orgId=1
It seems to be all host: sushil-linux-tw-kde

Suggestions

  • Likely sushil just sends data over telegraf to our grafana instance. Prevent that!
  • Investigate where the list of machines we check here is taken from
  • Introduce an additional telegraf data tag to our salt-controlled machines and adjust grafana queries/alerts to match this tag
  • In queries/panels to only show "our" hosts
  • In the alerts (maybe? Do we want to provide alerts for others as well?)
  • In the notification channels to only receive mails for hosts we care about

Out of scope

  • Confirm why it is allowed to push telegraf data from anywhere - should/can this be dropped?
  • Is there going to be a lot of (big) data unaccounted for?

Rollback actions

  • Remove pause for host=sushil-linux-tw-kde

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #137519: [alert] Failed systemd services - openqaworker1 - proc-sys-fs-binfmt_misc.mount, kernel modules already removed with old kernel still running size:MResolvednicksinger2023-10-062023-10-21

Actions
Actions #1

Updated by tinita 7 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 7 months ago

  • Target version set to Ready
Actions #3

Updated by okurz 7 months ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by okurz 7 months ago

  • Related to action #137519: [alert] Failed systemd services - openqaworker1 - proc-sys-fs-binfmt_misc.mount, kernel modules already removed with old kernel still running size:M added
Actions #5

Updated by okurz 7 months ago

  • Description updated (diff)
  • Priority changed from Urgent to High

Added silence and rollback action

Actions #6

Updated by okurz 7 months ago

  • Subject changed from [alert] Inode utilization inside the OSD infrastructure is too high to [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high
  • Description updated (diff)
Actions #7

Updated by livdywan 7 months ago

  • Subject changed from [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high to [alert] alerts about "host: sushil-linux-tw-kde" that tools team should not be notified about, e.g. Inode utilization inside the OSD infrastructure is too high size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by jbaier_cz 7 months ago

  • Assignee set to jbaier_cz
Actions #9

Updated by jbaier_cz 7 months ago

  • Status changed from Workable to In Progress

I can confirm the initial assumption. Just for the record, first data point is from 2023-10-06 08:45:40.000; our monitoring/grafana/alerting/inodes.yaml is taking any data from autogen.disk and acts upon them. There is currently no list of machines for this (and similar) alert so we probably want to make sure we tag "our" data.

Actions #10

Updated by jbaier_cz 7 months ago

Actions #11

Updated by openqa_review 7 months ago

  • Due date set to 2023-10-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by jbaier_cz 7 months ago

Now, we are also using the tag inside the alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1009

Final step, remove the pause and see if that helped.

Actions #13

Updated by jbaier_cz 7 months ago

  • Due date deleted (2023-10-24)
  • Status changed from In Progress to Resolved

Rollback actions done, no new alerts so far.

Actions

Also available in: Atom PDF