Project

General

Profile

Actions

action #55178

closed

** PROBLEM Service Alert: openqa.suse.de/fs_/var/lib/openqa is WARNING **

Added by okurz@suse.de over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2019-07-30
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

---------- Forwarded Message ----------

Subject: ** PROBLEM Service Alert: openqa.suse.de/fs_/var/lib/openqa is
WARNING **
Date: Wednesday, 7 August 2019, 13.07.03 CEST
From: Monitoring User nagios@suse.de
To: Oliver Kurz okurz@suse.com

Notification: PROBLEM
Host: openqa.suse.de
State: WARNING
Date/Time: Wed Aug 7 11:07:03 UTC 2019
Info: WARN - 93.1% used (8.37 of 9.00 TB), trend: +72.92 GB / 24 hours

Service: fs_/var/lib/openqa

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=fs_%2Fvar%2Flib%2Fopenqa



Subtasks 5 (0 open5 closed)

action #54872: Clean old fixed assets from OSDResolvedmkittler2019-07-30

Actions
openQA Tests - coordination #55556: [epic][osd][functional][u] osd: Untracked assets that most likely should be trackedResolvedokurz2019-08-15

Actions
openQA Tests - action #55571: [osd][staging][sle] Untracked assets that most likely should be tracked: SLE stagingResolvedandriinikitin2019-08-15

Actions
openQA Tests - action #55574: [osd][sle][functional][y] Untracked assets that most likely should be tracked: s390x iso files that are not used (should be not even synced)Resolvedriafarov2019-08-15

Actions
action #55652: ** PROBLEM Service Alert: openqa.suse.de/fs_/var/lib/openqa is WARNING ** - Cleanup of results+logsResolvedokurz2019-08-16

Actions
Actions #1

Updated by okurz over 4 years ago

  • Tracker changed from tickets to action
  • Project changed from openSUSE admin to openQA Infrastructure
  • Priority changed from Normal to Urgent
  • Private changed from Yes to No
Actions #2

Updated by okurz over 4 years ago

I did not yet acknowledge the monitoring alert as I fear it might be ignored for too long then.

Actions #3

Updated by okurz over 4 years ago

  • Description updated (diff)
Actions #4

Updated by okurz over 4 years ago

As of today we are at 97%. Trying to handle it. Running du is no fun though :( Reduced some job group sizes over https://openqa.suse.de/admin/assets and triggered another cleanup. Down to 96% already.
Now https://openqa.suse.de/admin/assets lists "Assets by group (total 3803.24 GiB)" out of 8807GiB currently used, leaves 5004GiB to other stuff, e.g. test results, thumbnails, logs. IMHO it makes sense if test results are taking more space than the assets, e.g. for bug investigation. I realized that QAM is taking up a bigger proportion in comparison to other teams. Queried the database and provided some data in #55637 with asking QAM to help

Actions #5

Updated by okurz over 4 years ago

  • Status changed from New to Workable
  • Assignee set to okurz
Actions #6

Updated by okurz over 4 years ago

  • Status changed from Workable to In Progress
Actions #7

Updated by okurz over 4 years ago

I am going over the job groups on https://openqa.suse.de/admin/assets# one by one and looking which ones are storing multiple builds of the same assets which looks like they have some reduncancy which I can reduce, e.g. "Development / Staging: SLE 15" with 120GB. Reducing that to 100GB. Another migration job group from 100GB to 80GB. Strangely it seems I can not edit job group properties of e.g. https://openqa.suse.de/admin/job_templates/34 directly which is based on YAML schedule but I should still be able to edit job group properties. Reported as #55646

Actions #8

Updated by okurz over 4 years ago

I went over the complete list of all job groups again and reduced the job group quota where feasible. At least this way I could reduce the usage to 91%, that is below the current monitoring warning threshold again.

Actions #9

Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback

Finished one run of "du":

openqa:/var/lib/openqa # sort -n du.log
0G      ./tmp
1G      ./.config
1G      ./.ssh
1G      ./db
1G      ./webui
10G     ./backup
1643G   ./images
2945G   ./testresults
3674G   ./share
8272G   .

With this I guess we can actually live for the time being and have enough time until we finished an upgrade and can use more/bigger volumes.

To help with the challenge of finding out which part of our data is taking up how much space we split into individual partitions. They are a bit harder to manage though.

Actions #10

Updated by okurz over 4 years ago

  • Status changed from Feedback to Resolved

It seems coolo (and mkittler?) were more severe in constraining job group way more deleting out a lot more data. Also it seems we successfully tricked Engineering Infrastructure as they can not easily reduce the size of the partitions we have already. Some subtickets are still open but in general I think we are good.

Actions #11

Updated by coolo over 4 years ago

Infra will come back to us if they got temporary room (you know Towers of Hanoi, no? :) - and the plan still is to reduce /results to 5TB. But we still have 1.5T headroom for that

Actions

Also available in: Atom PDF