action #64824: osd /results is at 99%, about to exceed available space - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #64824

closed

osd /results is at 99%, about to exceed available space

Added by okurz almost 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

okurz

Category:

Target version:

Start date:

2020-03-25

Due date:

% Done:

Estimated time:

Related issues 3 (1 open — 2 closed)

Related to openQA Project (public) - action #64574: Keep track of disk usage of results by job groups

Resolved

mkittler

2020-03-18

Actions

Related to openQA Infrastructure (public) - action #66922: osd: /results cleanup, see alert

Resolved

mkittler

2020-05-17

Actions

Copied to openQA Project (public) - action #64830: [ux][ui][easy][beginner] limit "keep_logs_in_days" to "keep_results_in_days" in webUI

Workable

2020-03-25

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz almost 5 years ago

We don't know (yet) which jobs or job groups account for which amount of used /results space but what we can easily discover is outliers which might have bad impact:

select id,name,keep_logs_in_days,keep_results_in_days from job_groups where (keep_logs_in_days > 10 or keep_results_in_days > 10) order by keep_logs_in_days desc limit 10; 
 id  |               name                | keep_logs_in_days | keep_results_in_days 
-----+-----------------------------------+-------------------+----------------------
 167 | SLE 12 Security                   |               365 |                  365
 268 | Security                          |               365 |                  365
 222 | Migration : SLE15GA Milestone     |               300 |                  200
 111 | Migration ： SLE15GA              |               300 |                  200
 198 | RT Acceptance: SLE 12 SP5         |               120 |                   90
 264 | Virtualization-Milestone          |                60 |                   70
 298 | WSL - 15.2                        |                60 |                   90
 263 | Virtualization-Acceptance         |                60 |                   70
  53 | Maintenance: SLE 12 SP2 Incidents |                60 |                   40
  41 | Maintenance: SLE 12 SP1 Incidents |                60 |                   90

Actions

Copy link

Updated by okurz almost 5 years ago

Copied to action #64830: [ux][ui][easy][beginner] limit "keep_logs_in_days" to "keep_results_in_days" in webUI added

Actions

Copy link

Updated by okurz almost 5 years ago

Reduced the following settings (logs, results):

SLE 12 Security: 365->30,365->200 (settings for "important" were actually lower, does not make sense to me)
Security: 365->30,365->200
Migration : SLE15GA Milestone: 300->30,200->180 (milestone builds should be "important" anyway)
Migration ： SLE15GA: 300->30,200->180
RT Acceptance: SLE 12 SP5: 120->30
Virtualization-Milestone: 60->30
WSL - 15.2: 60->30
Virtualization-Acceptance: 60->30
Maintenance: SLE 12 SP2 Incidents: 60->30
Maintenance: SLE 12 SP1 Incidents: 60->30

I am sorry if this is causing inconveniences. We can simply not provide the necessary space and we need to take these urgent measures to prevent more dangerous data loss. Please also keep in mind that I did not change any periods for "important" results, e.g. the ones that are linked to a bug or linked to important, tagged builds.

Triggered result cleanup explicitly.

EDIT: ok, with my cleanup free space has grown already from 50GB to 72GB and cleanup job is running. I guess this should last during the night. … Or not, there is a new SLE15SP2 build and space is depleting fast again. I did some drastic measure with openqa:/results # rm testresults/040[1-3][0-8]/*ltp*/video.ogv assuming that "more recent but not the latest ltp tests" do not need the video that much ;) This brought around another 30GB. Maybe this will last over night.

Actions

Copy link

Updated by okurz almost 5 years ago

The last cleanup has brought down the usage to 96% but over the following hours the usage again grew to 98% so the situation is still critical. I triggered another results cleanup job manually now. Using the queries from #64574 I have identified https://openqa.suse.de/tests/4039300 as currently the biggest, recorded job with 629MB recorded size in the database. Within osd /var/lib/openqa/testresults/04039/04039300-sle-15-SP2-Regression-on-Migration-from-SLE12-SP5-to-SLE15-SP2-x86_64-Build164.1-offline_sles12sp5_pscc_sdk-lp-we-asmm-contm-lgm-tcm-wsm_all_full@64bit by far the biggest contributor seems to be video.ogv with 581M. The job runs for 5:12h which is quite long. And we already know these candidates:

$ openqa-find-longest-running-test-modules https://openqa.suse.de/tests/4039300
92 s boot_to_desktop
93 s logs_from_installation_system
117 s welcome
145 s bootloader
183 s first_boot
187 s scc_registration
326 s check_package_version
1517 s install_service
4000 s await_install
4365 s patch_sle

Created #64845 for the migration specific task

Actions

Copy link