action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #131447

closed

Some jobs incomplete due to auto_review:"api failure: 400./tmp/.png.No space left on device.Utils.pm line 285":retry but enough space visible on machines

Added by rainerkoenig almost 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

kraih

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2023-06-27

Due date:

% Done:

Estimated time:

Description

Observation¶

Job in YaST Maintenance Update group does not run with the following error.

Happened on:

openqaworker-arm-1:7
[openqaworker-arm-3:5[(https://openqa.suse.de/tests/11452070)

@okurz checked the workers and didn't find any memory shortages.

Also observed on openqaworker16 so not aarch64 specific

Files

Bildschirmfoto_2023-06-27_09-57-12.png (47.3 KB) Bildschirmfoto_2023-06-27_09-57-12.png

rainerkoenig, 2023-06-27 07:57

Related issues 5 (2 open — 3 closed)

Related to openQA Infrastructure (public) - action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M

Resolved

okurz

2023-06-22

Actions

Related to openQA Infrastructure (public) - action #131459: [openQA][infra] OSD ran out of inodes without triggering a notification size:M

Resolved

nicksinger

2023-06-27

2023-07-15

Actions

Related to openQA Project (public) - action #131465: Make temporary files and directories created by openQA services easier to identify size:M

Resolved

tinita

2023-06-27

2023-07-13

Actions

Related to openQA Project (public) - action #131471: Leftover worker temporary directories in /tmp on OSD and O3 size:M

Workable

2023-06-27

Actions

Copied to openQA Project (public) - action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs?

New

Actions

Copy link

Updated by rainerkoenig almost 2 years ago

Project changed from qe-yam to openQA Infrastructure (public)

Actions

Copy link

Updated by okurz almost 2 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Subject changed from aarch64-virtio: Jobs don't run becaues "no space left on device" to Some jobs incomplete due to "no space left on device" but enough space visible on machines
Description updated (diff)
Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by kraih almost 2 years ago

Assignee set to kraih

Actions

Copy link

Updated by kraih almost 2 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by kraih almost 2 years ago

It appears that #131249 was responsible for the problem. There were no more inodes left on /dev/vda1 because the salt job history was growing too quickly, and various services on OSD went down as a result. We have stopped salt-master for now and cleaned up its temporary files. That has stabilised the situation.

Actions

Copy link

Updated by kraih almost 2 years ago

Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added

Actions

Copy link

Updated by kraih almost 2 years ago

Some problems have been identified along the way. 1) The root filesystem on OSD is used for a lot of temporary files, from uploads to OBS rsync, 2) The inode limit on /dev/vda1 is very low, 3) Our templates for temporary files created by different openQA services are too generic and don't allow for individual services to be identified quickly, 4) There is no Grafana panel for monitoring the number of remaining inodes.

Actions

Copy link

Updated by nicksinger almost 2 years ago

Related to action #131459: [openQA][infra] OSD ran out of inodes without triggering a notification size:M added

Actions

Copy link

Updated by kraih almost 2 years ago

Status changed from In Progress to Feedback

The immediate problem seems to have been resolved with the cleanup of salt temporary files. The salt setup will be fixed as part of #131249. And monitoring in Grafana will be set up in the followup ticket #131459.

Actions

Copy link

#10

Updated by kraih almost 2 years ago

Related to action #131465: Make temporary files and directories created by openQA services easier to identify size:M added

Actions

Copy link

#11

Updated by kraih almost 2 years ago

Related to action #131471: Leftover worker temporary directories in /tmp on OSD and O3 size:M added

Actions

Copy link

#12

Updated by okurz almost 2 years ago

Subject changed from Some jobs incomplete due to "no space left on device" but enough space visible on machines to Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines

Actions

Copy link

#13

Updated by okurz almost 2 years ago

Handling jobs that have not yet been restarted: I updated the subject, now calling export host=openqa.suse.de; ~/local/os-autoinst/scripts/openqa-monitor-investigation-candidates | ~/local/os-autoinst/scripts/openqa-label-known-issues-multi

Actions

Copy link

#14

Updated by okurz almost 2 years ago

Copied to action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs? added

Actions

Copy link

#15

Updated by okurz almost 2 years ago

Status changed from Feedback to Resolved

Discussed in unblock 2023-06-28. All follow-up tasks are in separate tickets by now. The fallback has been handled and I know how to not deplete inodes again :)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #131447

Some jobs incomplete due to auto_review:"api failure: 400./tmp/.png.No space left on device.Utils.pm line 285":retry but enough space visible on machines

Observation¶

Updated by rainerkoenig almost 2 years ago

Updated by okurz almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by kraih almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago