action #131447
closedSome jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines
Description
Observation¶
Job in YaST Maintenance Update group does not run with the following error.
Happened on:
- openqaworker-arm-1:7
- [openqaworker-arm-3:5[(https://openqa.suse.de/tests/11452070)
@okurz checked the workers and didn't find any memory shortages.
Also observed on openqaworker16 so not aarch64 specific
Files
Updated by rainerkoenig 10 months ago
- Project changed from qe-yam to openQA Infrastructure
Updated by okurz 10 months ago
- Project changed from openQA Infrastructure to openQA Project
- Subject changed from aarch64-virtio: Jobs don't run becaues "no space left on device" to Some jobs incomplete due to "no space left on device" but enough space visible on machines
- Description updated (diff)
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by kraih 10 months ago
It appears that #131249 was responsible for the problem. There were no more inodes left on /dev/vda1
because the salt job history was growing too quickly, and various services on OSD went down as a result. We have stopped salt-master
for now and cleaned up its temporary files. That has stabilised the situation.
Updated by kraih 10 months ago
- Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added
Updated by kraih 10 months ago
Some problems have been identified along the way. 1) The root filesystem on OSD is used for a lot of temporary files, from uploads to OBS rsync, 2) The inode limit on /dev/vda1
is very low, 3) Our templates for temporary files created by different openQA services are too generic and don't allow for individual services to be identified quickly, 4) There is no Grafana panel for monitoring the number of remaining inodes.
Updated by nicksinger 10 months ago
- Related to action #131459: [openQA][infra] OSD ran out of inodes without triggering a notification size:M added
Updated by kraih 10 months ago
- Related to action #131465: Make temporary files and directories created by openQA services easier to identify size:M added
Updated by kraih 10 months ago
- Related to action #131471: Leftover worker temporary directories in /tmp on OSD and O3 size:M added
Updated by okurz 10 months ago
- Subject changed from Some jobs incomplete due to "no space left on device" but enough space visible on machines to Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines
Updated by okurz 10 months ago
- Copied to action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs? added