action #131447
closedSome jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines
Description
Observation¶
Job in YaST Maintenance Update group does not run with the following error.
Happened on:
- openqaworker-arm-1:7
- [openqaworker-arm-3:5[(https://openqa.suse.de/tests/11452070)
@okurz checked the workers and didn't find any memory shortages.
Also observed on openqaworker16 so not aarch64 specific
Files
Updated by rainerkoenig over 1 year ago
- Project changed from qe-yam to openQA Infrastructure (public)
Updated by okurz over 1 year ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Subject changed from aarch64-virtio: Jobs don't run becaues "no space left on device" to Some jobs incomplete due to "no space left on device" but enough space visible on machines
- Description updated (diff)
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by kraih over 1 year ago
It appears that #131249 was responsible for the problem. There were no more inodes left on /dev/vda1
because the salt job history was growing too quickly, and various services on OSD went down as a result. We have stopped salt-master
for now and cleaned up its temporary files. That has stabilised the situation.
Updated by kraih over 1 year ago
- Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added
Updated by kraih over 1 year ago
Some problems have been identified along the way. 1) The root filesystem on OSD is used for a lot of temporary files, from uploads to OBS rsync, 2) The inode limit on /dev/vda1
is very low, 3) Our templates for temporary files created by different openQA services are too generic and don't allow for individual services to be identified quickly, 4) There is no Grafana panel for monitoring the number of remaining inodes.
Updated by nicksinger over 1 year ago
- Related to action #131459: [openQA][infra] OSD ran out of inodes without triggering a notification size:M added
Updated by kraih over 1 year ago
- Status changed from In Progress to Feedback
Updated by kraih over 1 year ago
- Related to action #131465: Make temporary files and directories created by openQA services easier to identify size:M added
Updated by kraih over 1 year ago
- Related to action #131471: Leftover worker temporary directories in /tmp on OSD and O3 size:M added
Updated by okurz over 1 year ago
- Subject changed from Some jobs incomplete due to "no space left on device" but enough space visible on machines to Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines
Updated by okurz over 1 year ago
Handling jobs that have not yet been restarted: I updated the subject, now calling export host=openqa.suse.de; ~/local/os-autoinst/scripts/openqa-monitor-investigation-candidates | ~/local/os-autoinst/scripts/openqa-label-known-issues-multi
Updated by okurz over 1 year ago
- Copied to action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs? added
Updated by okurz over 1 year ago
- Status changed from Feedback to Resolved
Discussed in unblock 2023-06-28. All follow-up tasks are in separate tickets by now. The fallback has been handled and I know how to not deplete inodes again :)