action #131459
closed[openQA][infra] OSD ran out of inodes without triggering a notification size:M
0%
Description
Motivation¶
Today 2023-06-27 OSD ran out of inodes for its root filesystem on /. This caused various issues regarding its availability.
It should be ensured that we mitigate this issue in the future and implement monitoring to get warned before we run out of inodes.
Acceptance criteria¶
- AC1: We have an alert informing us if we run out of free inodes on important filesystems
- AC2: Possible offending processes filling up inodes rapidly are reconfigured to mitigate further problems
Suggestions¶
- We already collect the relevant metric, see https://stats.openqa-monitor.qa.suse.de/explore?orgId=1&left=%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D,%22resultFormat%22:%22time_series%22,%22orderByTime%22:%22ASC%22,%22tags%22:%5B%7B%22key%22:%22host::tag%22,%22value%22:%22openqa%22,%22operator%22:%22%3D%22%7D,%7B%22key%22:%22path::tag%22,%22value%22:%22%2F%22,%22operator%22:%22%3D%22,%22condition%22:%22AND%22%7D%5D,%22groupBy%22:%5B%7B%22type%22:%22time%22,%22params%22:%5B%22$__interval%22%5D%7D,%7B%22type%22:%22fill%22,%22params%22:%5B%22null%22%5D%7D%5D,%22select%22:%5B%5B%7B%22type%22:%22field%22,%22params%22:%5B%22inodes_free%22%5D%7D,%7B%22type%22:%22mean%22,%22params%22:%5B%5D%7D%5D%5D,%22policy%22:%22autogen%22,%22measurement%22:%22disk%22%7D%5D,%22range%22:%7B%22from%22:%22now-7d%22,%22to%22:%22now%22%7D%7D for an example
- https://docs.saltproject.io/en/latest/topics/jobs/job_cache.html mentions several options to adjust. Most of them apply to a time-range which might not help if a lot of jobs run (as happened here). However, mounting the /var/cache/salt directory on a tmpfs could help to not bring the whole system down
- Create the relevant monitoring panels for each generic, worker, webui + alerts
Updated by nicksinger over 1 year ago
- Related to action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines added
Updated by kraih over 1 year ago
- Tags set to reactive work
- Subject changed from [openQA][infra] OSD ran out of inodes to [openQA][infra] OSD ran out of inodes without triggering a notification
- Assignee set to nicksinger
- Target version set to Ready
Updated by kraih over 1 year ago
Monitoring inodes in Grafana is a good next step from #131447.
Updated by okurz over 1 year ago
- Tags changed from reactive work to reactive work, infra
Updated by kraih over 1 year ago
From Slack:
Nick Singer: SELECT mean("inodes_used") / mean("inodes_total") FROM "autogen"."disk" WHERE ("host"::tag = 'openqa' AND "path"::tag = '/') AND $timeFilter GROUP BY time($__interval) fill(null) should to the trick
Tina Müller: I couldn't get it to work with the group by, I had to delete that
Tina Müller: but this looks good: https://monitor.qa.suse.de/d/1pHb56Lnk/tina-s-dashboard?orgId=1&refresh=5m&viewPanel=26
Updated by kraih over 1 year ago
- Assignee deleted (
nicksinger)
Unassigned, since i'm not sure if Nick or Tina is currently working on this.
Updated by okurz over 1 year ago
- Subject changed from [openQA][infra] OSD ran out of inodes without triggering a notification to [openQA][infra] OSD ran out of inodes without triggering a notification size:M
- Description updated (diff)
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-15
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger over 1 year ago
Added the panels for worker, webui and generic hosts: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/904
Not sure what this MR will produce but it is a good starting point to refine further. It is also required to create panels first before being able to attach an alert to them.
Updated by nicksinger over 1 year ago
Panels have been reworked yesterday to properly display data and be the same for all dashboards. Added an alert now https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?returnTo=%2Falerting%2Flist%3Fsearch%3Dinode which needs to be salted.
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Feedback
While at it I renamed our existing provisioned alert definitions: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/912 and created a new yaml-file for the unified inode alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/913
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved
MR merged and alert shows up as "provisioned": https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view
AC2 was already fulfilled earlier when we discovered that running salt-commands in a while-loop needs extra precaution.