Project

General

Profile

Actions

action #64824

closed

osd /results is at 99%, about to exceed available space

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
2020-03-25
Due date:
% Done:

0%

Estimated time:

Related issues 3 (1 open2 closed)

Related to openQA Project (public) - action #64574: Keep track of disk usage of results by job groupsResolvedmkittler2020-03-18

Actions
Related to openQA Infrastructure (public) - action #66922: osd: /results cleanup, see alertResolvedmkittler2020-05-17

Actions
Copied to openQA Project (public) - action #64830: [ux][ui][easy][beginner] limit "keep_logs_in_days" to "keep_results_in_days" in webUIWorkable2020-03-25

Actions
Actions #1

Updated by okurz over 4 years ago

We don't know (yet) which jobs or job groups account for which amount of used /results space but what we can easily discover is outliers which might have bad impact:

select id,name,keep_logs_in_days,keep_results_in_days from job_groups where (keep_logs_in_days > 10 or keep_results_in_days > 10) order by keep_logs_in_days desc limit 10; 
 id  |               name                | keep_logs_in_days | keep_results_in_days 
-----+-----------------------------------+-------------------+----------------------
 167 | SLE 12 Security                   |               365 |                  365
 268 | Security                          |               365 |                  365
 222 | Migration : SLE15GA Milestone     |               300 |                  200
 111 | Migration : SLE15GA              |               300 |                  200
 198 | RT Acceptance: SLE 12 SP5         |               120 |                   90
 264 | Virtualization-Milestone          |                60 |                   70
 298 | WSL - 15.2                        |                60 |                   90
 263 | Virtualization-Acceptance         |                60 |                   70
  53 | Maintenance: SLE 12 SP2 Incidents |                60 |                   40
  41 | Maintenance: SLE 12 SP1 Incidents |                60 |                   90
Actions #3

Updated by okurz over 4 years ago

  • Copied to action #64830: [ux][ui][easy][beginner] limit "keep_logs_in_days" to "keep_results_in_days" in webUI added
Actions #4

Updated by okurz over 4 years ago

Reduced the following settings (logs, results):

I am sorry if this is causing inconveniences. We can simply not provide the necessary space and we need to take these urgent measures to prevent more dangerous data loss. Please also keep in mind that I did not change any periods for "important" results, e.g. the ones that are linked to a bug or linked to important, tagged builds.

Triggered result cleanup explicitly.

EDIT: ok, with my cleanup free space has grown already from 50GB to 72GB and cleanup job is running. I guess this should last during the night. … Or not, there is a new SLE15SP2 build and space is depleting fast again. I did some drastic measure with openqa:/results # rm testresults/040[1-3][0-8]/*ltp*/video.ogv assuming that "more recent but not the latest ltp tests" do not need the video that much ;) This brought around another 30GB. Maybe this will last over night.

Actions #5

Updated by okurz over 4 years ago

The last cleanup has brought down the usage to 96% but over the following hours the usage again grew to 98% so the situation is still critical. I triggered another results cleanup job manually now. Using the queries from #64574 I have identified https://openqa.suse.de/tests/4039300 as currently the biggest, recorded job with 629MB recorded size in the database. Within osd /var/lib/openqa/testresults/04039/04039300-sle-15-SP2-Regression-on-Migration-from-SLE12-SP5-to-SLE15-SP2-x86_64-Build164.1-offline_sles12sp5_pscc_sdk-lp-we-asmm-contm-lgm-tcm-wsm_all_full@64bit by far the biggest contributor seems to be video.ogv with 581M. The job runs for 5:12h which is quite long. And we already know these candidates:

$ openqa-find-longest-running-test-modules https://openqa.suse.de/tests/4039300
92 s boot_to_desktop
93 s logs_from_installation_system
117 s welcome
145 s bootloader
183 s first_boot
187 s scc_registration
326 s check_package_version
1517 s install_service
4000 s await_install
4365 s patch_sle

Created #64845 for the migration specific task

Actions #6

Updated by okurz over 4 years ago

  • Related to action #64574: Keep track of disk usage of results by job groups added
Actions #7

Updated by okurz over 4 years ago

coolo is in the process of deleting all old videos

Actions #8

Updated by okurz over 4 years ago

  • Status changed from In Progress to Resolved

Stephan Kulow @coolo 8:18 the youngest video I deleted was from 3999999

with this we are down to /results 71% usage leaving a current headroom of 1.5TB again

Actions #9

Updated by mkittler over 4 years ago

  • Related to action #66922: osd: /results cleanup, see alert added
Actions

Also available in: Atom PDF