Actions
action #131459
closed[openQA][infra] OSD ran out of inodes without triggering a notification size:M
Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-06-27
Due date:
2023-07-15
% Done:
0%
Estimated time:
Tags:
Description
Motivation¶
Today 2023-06-27 OSD ran out of inodes for its root filesystem on /. This caused various issues regarding its availability.
It should be ensured that we mitigate this issue in the future and implement monitoring to get warned before we run out of inodes.
Acceptance criteria¶
- AC1: We have an alert informing us if we run out of free inodes on important filesystems
- AC2: Possible offending processes filling up inodes rapidly are reconfigured to mitigate further problems
Suggestions¶
- We already collect the relevant metric, see https://stats.openqa-monitor.qa.suse.de/explore?orgId=1&left=%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D,%22resultFormat%22:%22time_series%22,%22orderByTime%22:%22ASC%22,%22tags%22:%5B%7B%22key%22:%22host::tag%22,%22value%22:%22openqa%22,%22operator%22:%22%3D%22%7D,%7B%22key%22:%22path::tag%22,%22value%22:%22%2F%22,%22operator%22:%22%3D%22,%22condition%22:%22AND%22%7D%5D,%22groupBy%22:%5B%7B%22type%22:%22time%22,%22params%22:%5B%22$__interval%22%5D%7D,%7B%22type%22:%22fill%22,%22params%22:%5B%22null%22%5D%7D%5D,%22select%22:%5B%5B%7B%22type%22:%22field%22,%22params%22:%5B%22inodes_free%22%5D%7D,%7B%22type%22:%22mean%22,%22params%22:%5B%5D%7D%5D%5D,%22policy%22:%22autogen%22,%22measurement%22:%22disk%22%7D%5D,%22range%22:%7B%22from%22:%22now-7d%22,%22to%22:%22now%22%7D%7D for an example
- https://docs.saltproject.io/en/latest/topics/jobs/job_cache.html mentions several options to adjust. Most of them apply to a time-range which might not help if a lot of jobs run (as happened here). However, mounting the /var/cache/salt directory on a tmpfs could help to not bring the whole system down
- Create the relevant monitoring panels for each generic, worker, webui + alerts
Actions