Project

General

Profile

action #131459

Updated by okurz over 1 year ago

## Motivation 
 Today 2023-06-27 OSD ran out of inodes for its root filesystem on /. This caused various issues regarding its availability. 
 It should be ensured that we mitigate this issue in the future and implement monitoring to get warned before we run out of inodes. 

 ## Acceptance criteria 
 * **AC1:** We have an alert informing us if we run out of free inodes on important filesystems 
 * **AC2:** Possible offending processes filling up inodes rapidly are reconfigured to mitigate further problems 

 ## Suggestions 
 * We already collect the relevant metric, see https://stats.openqa-monitor.qa.suse.de/explore?orgId=1&left=%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D,%22resultFormat%22:%22time_series%22,%22orderByTime%22:%22ASC%22,%22tags%22:%5B%7B%22key%22:%22host::tag%22,%22value%22:%22openqa%22,%22operator%22:%22%3D%22%7D,%7B%22key%22:%22path::tag%22,%22value%22:%22%2F%22,%22operator%22:%22%3D%22,%22condition%22:%22AND%22%7D%5D,%22groupBy%22:%5B%7B%22type%22:%22time%22,%22params%22:%5B%22$__interval%22%5D%7D,%7B%22type%22:%22fill%22,%22params%22:%5B%22null%22%5D%7D%5D,%22select%22:%5B%5B%7B%22type%22:%22field%22,%22params%22:%5B%22inodes_free%22%5D%7D,%7B%22type%22:%22mean%22,%22params%22:%5B%5D%7D%5D%5D,%22policy%22:%22autogen%22,%22measurement%22:%22disk%22%7D%5D,%22range%22:%7B%22from%22:%22now-7d%22,%22to%22:%22now%22%7D%7D for an example 
 * https://docs.saltproject.io/en/latest/topics/jobs/job_cache.html mentions several options to adjust. Most of them apply to a time-range which might not help if a lot of jobs run (as happened here). However, mounting the /var/cache/salt directory on a tmpfs could help to not bring the whole system down 
 * Create the relevant monitoring panels for each generic, worker, webui + alerts

Back