Project

General

Profile

action #92338

[Alerting] File systems alert, / on osd

Added by okurz 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-05-08
Due date:
2021-06-02
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

[Alerting] File systems alert
One of the file systems is too full

Metric name
Value
/: Used Percentage
90.049

see https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=74&orgId=1

output of du -x --max-depth=1 -BM | sort -n

1M      ./lost+found
1M      ./mnt
1M      ./selinux
1M      ./storage
2M      ./bin
6M      ./sbin
12M     ./lib64
26M     ./etc
46M     ./root
99M     ./boot
147M    ./opt
1085M   ./lib
1284M   ./var
4128M   ./usr
10083M  ./tmp
16915M  .

seems like /tmp has a very big contribution now. A lot of temporary directories like 6TyfduRNJ6, oldest one since 2021-03-24 03:48 . Unfortunately there are hardly any logs going back in time, like because / is that full that also the systemd journal does not save more. 2021-03-24 is not a date where we commonly automatically reboot the system so not sure if non-openQA package upgrades caused a change.
I found some files like tmp.vvDweS5srZ which look like autoinst-log.txt or worker-log.txt . I assume that these are temporary files from openqa-investigate


Subtasks

QA - action #92341: fix potential leak of tempfiles of openqa-label-known-issuesResolvedokurz

openQA Project - action #92344: many temp-folders left over from live openQA jobs, regression?Resolvedmkittler

History

#1 Updated by okurz 3 months ago

  • Assignee set to okurz
  • Priority changed from Urgent to High

I deleted some directories and files on osd. Likely a similar problem can exist on osd. Maybe https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues#L134 combined with an unexpected exit of the script could be the problem. the tempfile is deleted but only if the function exits successfully. I assume we should ensure deleting that file in an EXIT handler

#2 Updated by okurz 3 months ago

  • Status changed from New to Feedback

#3 Updated by okurz 3 months ago

  • Status changed from Feedback to Blocked

I will track https://github.com/os-autoinst/scripts/pull/72 in #92341 after I found that a directory like /tmp/FOWvYnWzKt from 2021-03-24 05:45, the first non-empty directory, has a content:

1616561148_719579.png  autoinst-log-live.txt  last.png  serial-terminal-live.txt

this looks more like a regression in openQA or some dependency. Created #92344

And deleted many temporary files and directories on osd so that we are back to 38% usage now, 12G available.

#4 Updated by okurz 2 months ago

  • Status changed from Blocked to Feedback

#5 Updated by okurz about 2 months ago

  • Status changed from Feedback to Resolved

MR merged and effective since some days. state on osd in /tmp looks fine

Also available in: Atom PDF