Project

General

Profile

Actions

action #131459

closed

[openQA][infra] OSD ran out of inodes without triggering a notification size:M

Added by nicksinger about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-06-27
Due date:
2023-07-15
% Done:

0%

Estimated time:

Description

Motivation

Today 2023-06-27 OSD ran out of inodes for its root filesystem on /. This caused various issues regarding its availability.
It should be ensured that we mitigate this issue in the future and implement monitoring to get warned before we run out of inodes.

Acceptance criteria

  • AC1: We have an alert informing us if we run out of free inodes on important filesystems
  • AC2: Possible offending processes filling up inodes rapidly are reconfigured to mitigate further problems

Suggestions


Related issues 1 (0 open1 closed)

Related to openQA Project - action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machinesResolvedkraih2023-06-27

Actions
Actions #1

Updated by nicksinger about 1 year ago

  • Related to action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines added
Actions #2

Updated by kraih about 1 year ago

  • Tags set to reactive work
  • Subject changed from [openQA][infra] OSD ran out of inodes to [openQA][infra] OSD ran out of inodes without triggering a notification
  • Assignee set to nicksinger
  • Target version set to Ready
Actions #3

Updated by kraih about 1 year ago

Monitoring inodes in Grafana is a good next step from #131447.

Actions #4

Updated by okurz about 1 year ago

  • Tags changed from reactive work to reactive work, infra
Actions #5

Updated by kraih about 1 year ago

From Slack:

Nick Singer: SELECT mean("inodes_used") / mean("inodes_total") FROM "autogen"."disk" WHERE ("host"::tag = 'openqa' AND "path"::tag = '/') AND $timeFilter GROUP BY time($__interval) fill(null) should to the trick
Tina Müller: I couldn't get it to work with the group by, I had to delete that
Tina Müller: but this looks good: https://monitor.qa.suse.de/d/1pHb56Lnk/tina-s-dashboard?orgId=1&refresh=5m&viewPanel=26
Actions #6

Updated by kraih about 1 year ago

  • Assignee deleted (nicksinger)

Unassigned, since i'm not sure if Nick or Tina is currently working on this.

Actions #7

Updated by nicksinger about 1 year ago

  • Assignee set to nicksinger
Actions #8

Updated by nicksinger about 1 year ago

  • Status changed from New to In Progress
Actions #9

Updated by okurz about 1 year ago

  • Subject changed from [openQA][infra] OSD ran out of inodes without triggering a notification to [openQA][infra] OSD ran out of inodes without triggering a notification size:M
  • Description updated (diff)
Actions #10

Updated by openqa_review about 1 year ago

  • Due date set to 2023-07-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by nicksinger about 1 year ago

Added the panels for worker, webui and generic hosts: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/904
Not sure what this MR will produce but it is a good starting point to refine further. It is also required to create panels first before being able to attach an alert to them.

Actions #12

Updated by nicksinger about 1 year ago

Panels have been reworked yesterday to properly display data and be the same for all dashboards. Added an alert now https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?returnTo=%2Falerting%2Flist%3Fsearch%3Dinode which needs to be salted.

Actions #13

Updated by nicksinger about 1 year ago

  • Status changed from In Progress to Feedback

While at it I renamed our existing provisioned alert definitions: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/912 and created a new yaml-file for the unified inode alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/913

Actions #14

Updated by nicksinger about 1 year ago

  • Status changed from Feedback to Resolved

MR merged and alert shows up as "provisioned": https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view
AC2 was already fulfilled earlier when we discovered that running salt-commands in a while-loop needs extra precaution.

Actions

Also available in: Atom PDF