Project

General

Profile

Actions

action #131447

closed

Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines

Added by rainerkoenig 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-06-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

Job in YaST Maintenance Update group does not run with the following error.

Happened on:

@okurz checked the workers and didn't find any memory shortages.

Also observed on openqaworker16 so not aarch64 specific


Files


Related issues 5 (2 open3 closed)

Related to openQA Infrastructure - action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:MResolvedokurz2023-06-22

Actions
Related to openQA Infrastructure - action #131459: [openQA][infra] OSD ran out of inodes without triggering a notification size:MResolvednicksinger2023-06-272023-07-15

Actions
Related to openQA Project - action #131465: Make temporary files and directories created by openQA services easier to identify size:MResolvedtinita2023-06-272023-07-13

Actions
Related to openQA Project - action #131471: Leftover worker temporary directories in /tmp on OSD and O3 size:MWorkable2023-06-27

Actions
Copied to openQA Project - action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs?New

Actions
Actions #1

Updated by rainerkoenig 10 months ago

  • Project changed from qe-yam to openQA Infrastructure
Actions #2

Updated by okurz 10 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from aarch64-virtio: Jobs don't run becaues "no space left on device" to Some jobs incomplete due to "no space left on device" but enough space visible on machines
  • Description updated (diff)
  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #3

Updated by kraih 10 months ago

  • Assignee set to kraih
Actions #4

Updated by kraih 10 months ago

  • Status changed from New to In Progress
Actions #5

Updated by kraih 10 months ago

It appears that #131249 was responsible for the problem. There were no more inodes left on /dev/vda1 because the salt job history was growing too quickly, and various services on OSD went down as a result. We have stopped salt-master for now and cleaned up its temporary files. That has stabilised the situation.

Actions #6

Updated by kraih 10 months ago

  • Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added
Actions #7

Updated by kraih 10 months ago

Some problems have been identified along the way. 1) The root filesystem on OSD is used for a lot of temporary files, from uploads to OBS rsync, 2) The inode limit on /dev/vda1 is very low, 3) Our templates for temporary files created by different openQA services are too generic and don't allow for individual services to be identified quickly, 4) There is no Grafana panel for monitoring the number of remaining inodes.

Actions #8

Updated by nicksinger 10 months ago

  • Related to action #131459: [openQA][infra] OSD ran out of inodes without triggering a notification size:M added
Actions #9

Updated by kraih 10 months ago

  • Status changed from In Progress to Feedback

The immediate problem seems to have been resolved with the cleanup of salt temporary files. The salt setup will be fixed as part of #131249. And monitoring in Grafana will be set up in the followup ticket #131459.

Actions #10

Updated by kraih 10 months ago

  • Related to action #131465: Make temporary files and directories created by openQA services easier to identify size:M added
Actions #11

Updated by kraih 10 months ago

  • Related to action #131471: Leftover worker temporary directories in /tmp on OSD and O3 size:M added
Actions #12

Updated by okurz 10 months ago

  • Subject changed from Some jobs incomplete due to "no space left on device" but enough space visible on machines to Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines
Actions #13

Updated by okurz 10 months ago

Handling jobs that have not yet been restarted: I updated the subject, now calling export host=openqa.suse.de; ~/local/os-autoinst/scripts/openqa-monitor-investigation-candidates | ~/local/os-autoinst/scripts/openqa-label-known-issues-multi

Actions #14

Updated by okurz 10 months ago

  • Copied to action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs? added
Actions #15

Updated by okurz 10 months ago

  • Status changed from Feedback to Resolved

Discussed in unblock 2023-06-28. All follow-up tasks are in separate tickets by now. The fallback has been handled and I know how to not deplete inodes again :)

Actions

Also available in: Atom PDF