QA (public) &raquo; openQA Project (public) &raquo; openQA Infrastructure (public)

Category:

Target version:

Start date:

2019-07-30

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

---------- Forwarded Message ----------

Subject: ** PROBLEM Service Alert: openqa.suse.de/fs_/var/lib/openqa is
WARNING **
Date: Wednesday, 7 August 2019, 13.07.03 CEST
From: Monitoring User nagios@suse.de
To: Oliver Kurz okurz@suse.com

Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Wed Aug 7 11:07:03 UTC 2019
Info: WARN - 93.1% used (8.37 of 9.00 TB), trend: +72.92 GB / 24 hours

Service: fs_/var/lib/openqa

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fvar%2Flib%2Fopenqa

Subtasks 6 (0 open — 6 closed)

action #54872: Clean old fixed assets from OSD

Resolved

mkittler

2019-07-30

openQA Tests (public) - coordination #55556: [epic][osd][functional][u] osd: Untracked assets that most likely should be tracked

Resolved

2019-08-15

Containers and images - action #55568: [jeos][osd] Untracked assets that most likely should be tracked: JeOS

Resolved

2019-08-15

openQA Tests (public) - action #55571: [osd][staging][sle] Untracked assets that most likely should be tracked: SLE staging

Resolved

andriinikitin

2019-08-15

openQA Tests (public) - action #55574: [osd][sle][functional][y] Untracked assets that most likely should be tracked: s390x iso files that are not used (should be not even synced)

Resolved

riafarov

2019-08-15

action #55652: ** PROBLEM Service Alert: openqa.suse.de/fs_/var/lib/openqa is WARNING ** - Cleanup of results+logs

Resolved

2019-08-16

Updated by okurz almost 6 years ago

Tracker changed from tickets to action
Project changed from openSUSE admin to openQA Infrastructure (public)
Priority changed from Normal to Urgent
Private changed from Yes to No

Actions

Updated by okurz almost 6 years ago

I did not yet acknowledge the monitoring alert as I fear it might be ignored for too long then.

Actions

Updated by okurz over 5 years ago

Description updated (diff)

Actions

Updated by okurz over 5 years ago

As of today we are at 97%. Trying to handle it. Running du is no fun though :( Reduced some job group sizes over https://openqa.suse.de/admin/assets and triggered another cleanup. Down to 96% already.
Now https://openqa.suse.de/admin/assets lists "Assets by group (total 3803.24 GiB)" out of 8807GiB currently used, leaves 5004GiB to other stuff, e.g. test results, thumbnails, logs. IMHO it makes sense if test results are taking more space than the assets, e.g. for bug investigation. I realized that QAM is taking up a bigger proportion in comparison to other teams. Queried the database and provided some data in #55637 with asking QAM to help

Actions

Updated by okurz over 5 years ago

Status changed from New to Workable
Assignee set to okurz

Actions

Updated by okurz over 5 years ago

Status changed from Workable to In Progress

Actions

Updated by okurz over 5 years ago

I am going over the job groups on https://openqa.suse.de/admin/assets# one by one and looking which ones are storing multiple builds of the same assets which looks like they have some reduncancy which I can reduce, e.g. "Development / Staging: SLE 15" with 120GB. Reducing that to 100GB. Another migration job group from 100GB to 80GB. Strangely it seems I can not edit job group properties of e.g. https://openqa.suse.de/admin/job_templates/34 directly which is based on YAML schedule but I should still be able to edit job group properties. Reported as #55646

Actions

Updated by okurz over 5 years ago

I went over the complete list of all job groups again and reduced the job group quota where feasible. At least this way I could reduce the usage to 91%, that is below the current monitoring warning threshold again.

Actions

Updated by okurz over 5 years ago

Status changed from In Progress to Feedback

Finished one run of "du":

openqa:/var/lib/openqa # sort -n du.log
0G      ./tmp
1G      ./.config
1G      ./.ssh
1G      ./db
1G      ./webui
10G     ./backup
1643G   ./images
2945G   ./testresults
3674G   ./share
8272G   .

With this I guess we can actually live for the time being and have enough time until we finished an upgrade and can use more/bigger volumes.

To help with the challenge of finding out which part of our data is taking up how much space we split into individual partitions. They are a bit harder to manage though.

Actions

#10

Updated by okurz over 5 years ago

Status changed from Feedback to Resolved

It seems coolo (and mkittler?) were more severe in constraining job group way more deleting out a lot more data. Also it seems we successfully tricked Engineering Infrastructure as they can not easily reduce the size of the partitions we have already. Some subtickets are still open but in general I think we are good.

Actions