Project

General

Profile

Actions

action #131459

closed

[openQA][infra] OSD ran out of inodes without triggering a notification size:M

Added by nicksinger 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-06-27
Due date:
2023-07-15
% Done:

0%

Estimated time:

Description

Motivation

Today 2023-06-27 OSD ran out of inodes for its root filesystem on /. This caused various issues regarding its availability.
It should be ensured that we mitigate this issue in the future and implement monitoring to get warned before we run out of inodes.

Acceptance criteria

  • AC1: We have an alert informing us if we run out of free inodes on important filesystems
  • AC2: Possible offending processes filling up inodes rapidly are reconfigured to mitigate further problems

Suggestions


Related issues 1 (0 open1 closed)

Related to openQA Project - action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machinesResolvedkraih2023-06-27

Actions
Actions #1

Updated by nicksinger 10 months ago

  • Related to action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines added
Actions #2

Updated by kraih 10 months ago

  • Tags set to reactive work
  • Subject changed from [openQA][infra] OSD ran out of inodes to [openQA][infra] OSD ran out of inodes without triggering a notification
  • Assignee set to nicksinger
  • Target version set to Ready
Actions #3

Updated by kraih 10 months ago

Monitoring inodes in Grafana is a good next step from #131447.

Actions #4

Updated by okurz 10 months ago

  • Tags changed from reactive work to reactive work, infra
Actions #5

Updated by kraih 10 months ago

From Slack:

Nick Singer: SELECT mean("inodes_used") / mean("inodes_total") FROM "autogen"."disk" WHERE ("host"::tag = 'openqa' AND "path"::tag = '/') AND $timeFilter GROUP BY time($__interval) fill(null) should to the trick
Tina Müller: I couldn't get it to work with the group by, I had to delete that
Tina Müller: but this looks good: https://monitor.qa.suse.de/d/1pHb56Lnk/tina-s-dashboard?orgId=1&refresh=5m&viewPanel=26
Actions #6

Updated by kraih 10 months ago

  • Assignee deleted (nicksinger)

Unassigned, since i'm not sure if Nick or Tina is currently working on this.

Actions #7

Updated by nicksinger 10 months ago

  • Assignee set to nicksinger
Actions #8

Updated by nicksinger 10 months ago

  • Status changed from New to In Progress
Actions #9

Updated by okurz 10 months ago

  • Subject changed from [openQA][infra] OSD ran out of inodes without triggering a notification to [openQA][infra] OSD ran out of inodes without triggering a notification size:M
  • Description updated (diff)
Actions #10

Updated by openqa_review 10 months ago

  • Due date set to 2023-07-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by nicksinger 10 months ago

Added the panels for worker, webui and generic hosts: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/904
Not sure what this MR will produce but it is a good starting point to refine further. It is also required to create panels first before being able to attach an alert to them.

Actions #12

Updated by nicksinger 10 months ago

Panels have been reworked yesterday to properly display data and be the same for all dashboards. Added an alert now https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?returnTo=%2Falerting%2Flist%3Fsearch%3Dinode which needs to be salted.

Actions #13

Updated by nicksinger 10 months ago

  • Status changed from In Progress to Feedback

While at it I renamed our existing provisioned alert definitions: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/912 and created a new yaml-file for the unified inode alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/913

Actions #14

Updated by nicksinger 10 months ago

  • Status changed from Feedback to Resolved

MR merged and alert shows up as "provisioned": https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view
AC2 was already fulfilled earlier when we discovered that running salt-commands in a while-loop needs extra precaution.

Actions

Also available in: Atom PDF