Project

General

Profile

action #66922

osd: /results cleanup, see alert

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
Start date:
2020-05-17
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

Suggestions


Related issues

Related to openQA Project - action #67087: Allow to configure retention period for the video individuallyNew2020-05-20

Related to openQA Infrastructure - action #64824: osd /results is at 99%, about to exceed available spaceResolved2020-03-25

History

#1 Updated by okurz about 1 year ago

  • Assignee set to mkittler

mkittler inform QA SLE that important builds take up a lot of space and ensure to delete important build tags on https://openqa.suse.de/parent_group_overview/15 for older candidates

Ideas for new features: To save space in openqa keep dot of job results in database for longer but delete everything on filesystem maybe even move database content to slow different database; quota for results and logs to show that important builds take up more space

@nsinger ask SUSE IT if we can have less flash storage but more rotating disk storage

#2 Updated by mkittler about 1 year ago

I mentioned the problem in the chat and removed the tags. Let's see how much disk space this frees when the cleanup runs.

If not we need to reduce the duration for keeping logs. Considering the graph it seems "Maintenance: Single Incidents/Maintenance: SLE 15 SP1 Incidents" uses as much disk space as one month ago. Besides that the big groups "SLE 15/Migration" and "SLE 15/Migration: Regression", "SLE 15/Functional" and "SLE 15/File Systems" grew a lot making them candidates to reduce the log storage duration.

And we could of course look into implementing the suggested feature although not to resolve the immediate issue. (Not sure how long it will take to implement the feature.)

#3 Updated by okurz about 1 year ago

mkittler wrote:

I mentioned the problem in the chat and removed the tags. Let's see how much disk space this frees when the cleanup runs.

If not we need to reduce the duration for keeping logs. Considering the graph it seems "Maintenance: Single Incidents/Maintenance: SLE 15 SP1 Incidents" uses as much disk space as one month ago. Besides that the big groups "SLE 15/Migration" and "SLE 15/Migration: Regression", "SLE 15/Functional" and "SLE 15/File Systems" grew a lot making them candidates to reduce the log storage duration.

And we could of course look into implementing the suggested feature although not to resolve the immediate issue. (Not sure how long it will take to implement the feature.)

I suggest we focus on the immediate problem first by working on the groups you mentioned. The feature proposals should be followed on after that and not be rushed.

Could you query some more details from the aforementioned job groups and either apply manual mitigation, e.g. remove video files, or whatever is the biggest offender? Also at best get in contact with job group maintainers and suggest to reduce the log retention periods with the additional hint to make tests "more efficient" by not producing so heavy results. There are tickets for this and they already know about this but they need to learn that a consequence of inefficient testing is also that we can only store less test results.

#4 Updated by pcervinka about 1 year ago

yosun could you please align with mkittler and improve configuration of SLE 15 / File Systems group in osd? thank you

#5 Updated by yosun about 1 year ago

I checked in filesystem job group, all logs very small, like KB or tens of KB.
One exception is video, because those filesystem tests take a lot of times, which makes video log big. I partly agree with Oliver's suggestion.
IMO, if technically possible, for filesystem job group we could reduce the video log retention periods only.

#6 Updated by mkittler about 1 year ago

  • Related to action #67087: Allow to configure retention period for the video individually added

#7 Updated by mkittler about 1 year ago

  • Related to action #64824: osd /results is at 99%, about to exceed available space added

#8 Updated by mkittler about 1 year ago

I've been adding #64824 as a reference to reduce retention periods.

#9 Updated by mkittler about 1 year ago

  • Status changed from Workable to Resolved
  • Target version deleted (Current Sprint)

I've been deleting videos older than 1 month and it freed up enough disk space so we're good again for the moment.


Here's a PR for automating the process if the disk usage gets critical again: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/307

This is basically a simple/stupid alternative to #67087. It is "stupid" in the sense that it doesn't distinguish between important and not important builds and it bypasses openQA's result size tracking. Bypassing openQA's result size tracking means that the "Approximate result size by job group" graph in our monitoring will not be aware of the removal and therefore might be quite inaccurate.


Since scenarios with long execution times seem to be the biggest offenders I created a PR to disable the video in those scenarios by default: https://github.com/os-autoinst/openQA/pull/3112


Since the immediate alert is solved I'm closing the ticket. Let's see how well the proposed PRs are accepted. Likely we nevertheless need to decrease retention periods in the future.

Also available in: Atom PDF