action #131459
closed[openQA][infra] OSD ran out of inodes without triggering a notification size:M
0%
Description
Motivation¶
Today 2023-06-27 OSD ran out of inodes for its root filesystem on /. This caused various issues regarding its availability.
It should be ensured that we mitigate this issue in the future and implement monitoring to get warned before we run out of inodes.
Acceptance criteria¶
- AC1: We have an alert informing us if we run out of free inodes on important filesystems
- AC2: Possible offending processes filling up inodes rapidly are reconfigured to mitigate further problems
Suggestions¶
- We already collect the relevant metric, see https://stats.openqa-monitor.qa.suse.de/explore?orgId=1&left=%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22influxdb%22,%22uid%22:%22000000001%22%7D,%22resultFormat%22:%22time_series%22,%22orderByTime%22:%22ASC%22,%22tags%22:%5B%7B%22key%22:%22host::tag%22,%22value%22:%22openqa%22,%22operator%22:%22%3D%22%7D,%7B%22key%22:%22path::tag%22,%22value%22:%22%2F%22,%22operator%22:%22%3D%22,%22condition%22:%22AND%22%7D%5D,%22groupBy%22:%5B%7B%22type%22:%22time%22,%22params%22:%5B%22$__interval%22%5D%7D,%7B%22type%22:%22fill%22,%22params%22:%5B%22null%22%5D%7D%5D,%22select%22:%5B%5B%7B%22type%22:%22field%22,%22params%22:%5B%22inodes_free%22%5D%7D,%7B%22type%22:%22mean%22,%22params%22:%5B%5D%7D%5D%5D,%22policy%22:%22autogen%22,%22measurement%22:%22disk%22%7D%5D,%22range%22:%7B%22from%22:%22now-7d%22,%22to%22:%22now%22%7D%7D for an example
- https://docs.saltproject.io/en/latest/topics/jobs/job_cache.html mentions several options to adjust. Most of them apply to a time-range which might not help if a lot of jobs run (as happened here). However, mounting the /var/cache/salt directory on a tmpfs could help to not bring the whole system down
- Create the relevant monitoring panels for each generic, worker, webui + alerts
Updated by nicksinger 10 months ago
- Related to action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines added
Updated by kraih 10 months ago
From Slack:
Nick Singer: SELECT mean("inodes_used") / mean("inodes_total") FROM "autogen"."disk" WHERE ("host"::tag = 'openqa' AND "path"::tag = '/') AND $timeFilter GROUP BY time($__interval) fill(null) should to the trick
Tina Müller: I couldn't get it to work with the group by, I had to delete that
Tina Müller: but this looks good: https://monitor.qa.suse.de/d/1pHb56Lnk/tina-s-dashboard?orgId=1&refresh=5m&viewPanel=26
Updated by openqa_review 10 months ago
- Due date set to 2023-07-15
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 10 months ago
Added the panels for worker, webui and generic hosts: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/904
Not sure what this MR will produce but it is a good starting point to refine further. It is also required to create panels first before being able to attach an alert to them.
Updated by nicksinger 10 months ago
Panels have been reworked yesterday to properly display data and be the same for all dashboards. Added an alert now https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?returnTo=%2Falerting%2Flist%3Fsearch%3Dinode which needs to be salted.
Updated by nicksinger 10 months ago
- Status changed from In Progress to Feedback
While at it I renamed our existing provisioned alert definitions: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/912 and created a new yaml-file for the unified inode alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/913
Updated by nicksinger 10 months ago
- Status changed from Feedback to Resolved
MR merged and alert shows up as "provisioned": https://stats.openqa-monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view
AC2 was already fulfilled earlier when we discovered that running salt-commands in a while-loop needs extra precaution.