action #64824: osd /results is at 99%, about to exceed available space - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #64824

closed

osd /results is at 99%, about to exceed available space

Added by okurz about 5 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

okurz

Category:

Target version:

Start date:

2020-03-25

Due date:

% Done:

Estimated time:

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by okurz about 5 years ago

We don't know (yet) which jobs or job groups account for which amount of used /results space but what we can easily discover is outliers which might have bad impact:

select id,name,keep_logs_in_days,keep_results_in_days from job_groups where (keep_logs_in_days > 10 or keep_results_in_days > 10) order by keep_logs_in_days desc limit 10; 
 id  |               name                | keep_logs_in_days | keep_results_in_days 
-----+-----------------------------------+-------------------+----------------------
 167 | SLE 12 Security                   |               365 |                  365
 268 | Security                          |               365 |                  365
 222 | Migration : SLE15GA Milestone     |               300 |                  200
 111 | Migration ： SLE15GA              |               300 |                  200
 198 | RT Acceptance: SLE 12 SP5         |               120 |                   90
 264 | Virtualization-Milestone          |                60 |                   70
 298 | WSL - 15.2                        |                60 |                   90
 263 | Virtualization-Acceptance         |                60 |                   70
  53 | Maintenance: SLE 12 SP2 Incidents |                60 |                   40
  41 | Maintenance: SLE 12 SP1 Incidents |                60 |                   90

Actions

Copy link

Updated by okurz about 5 years ago

Copied to action #64830: [ux][ui][easy][beginner] limit "keep_logs_in_days" to "keep_results_in_days" in webUI added

Actions

Copy link

Updated by okurz about 5 years ago

Reduced the following settings (logs, results):

SLE 12 Security: 365->30,365->200 (settings for "important" were actually lower, does not make sense to me)
Security: 365->30,365->200
Migration : SLE15GA Milestone: 300->30,200->180 (milestone builds should be "important" anyway)
Migration ： SLE15GA: 300->30,200->180
RT Acceptance: SLE 12 SP5: 120->30
Virtualization-Milestone: 60->30
WSL - 15.2: 60->30
Virtualization-Acceptance: 60->30
Maintenance: SLE 12 SP2 Incidents: 60->30
Maintenance: SLE 12 SP1 Incidents: 60->30

I am sorry if this is causing inconveniences. We can simply not provide the necessary space and we need to take these urgent measures to prevent more dangerous data loss. Please also keep in mind that I did not change any periods for "important" results, e.g. the ones that are linked to a bug or linked to important, tagged builds.

Triggered result cleanup explicitly.

EDIT: ok, with my cleanup free space has grown already from 50GB to 72GB and cleanup job is running. I guess this should last during the night. … Or not, there is a new SLE15SP2 build and space is depleting fast again. I did some drastic measure with openqa:/results # rm testresults/040[1-3][0-8]/*ltp*/video.ogv assuming that "more recent but not the latest ltp tests" do not need the video that much ;) This brought around another 30GB. Maybe this will last over night.

Actions

Copy link

Updated by okurz about 5 years ago

The last cleanup has brought down the usage to 96% but over the following hours the usage again grew to 98% so the situation is still critical. I triggered another results cleanup job manually now. Using the queries from #64574 I have identified https://openqa.suse.de/tests/4039300 as currently the biggest, recorded job with 629MB recorded size in the database. Within osd /var/lib/openqa/testresults/04039/04039300-sle-15-SP2-Regression-on-Migration-from-SLE12-SP5-to-SLE15-SP2-x86_64-Build164.1-offline_sles12sp5_pscc_sdk-lp-we-asmm-contm-lgm-tcm-wsm_all_full@64bit by far the biggest contributor seems to be video.ogv with 581M. The job runs for 5:12h which is quite long. And we already know these candidates:

$ openqa-find-longest-running-test-modules https://openqa.suse.de/tests/4039300
92 s boot_to_desktop
93 s logs_from_installation_system
117 s welcome
145 s bootloader
183 s first_boot
187 s scc_registration
326 s check_package_version
1517 s install_service
4000 s await_install
4365 s patch_sle

Created #64845 for the migration specific task

Actions

Copy link