action #66922: osd: /results cleanup, see alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #66922

closed

osd: /results cleanup, see alert

Added by okurz about 5 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

Start date:

2020-05-17

Due date:

% Done:

Estimated time:

Description

Acceptance criteria¶

AC1: alert on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&panelId=74&fullscreen&edit&tab=alert&refresh=30s has vanished

Suggestions¶

Lookup tickets of last time we have handled this
Decide based on https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&fullscreen&panelId=19 which job group can be reduced
Inform users
Optionally ask for space increase again

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by okurz about 5 years ago

Assignee set to mkittler

@mkittler inform QA SLE that important builds take up a lot of space and ensure to delete important build tags on https://openqa.suse.de/parent_group_overview/15 for older candidates

Ideas for new features: To save space in openqa keep dot of job results in database for longer but delete everything on filesystem maybe even move database content to slow different database; quota for results and logs to show that important builds take up more space

@nsinger ask SUSE IT if we can have less flash storage but more rotating disk storage

Actions

Copy link

Updated by mkittler about 5 years ago

I mentioned the problem in the chat and removed the tags. Let's see how much disk space this frees when the cleanup runs.

If not we need to reduce the duration for keeping logs. Considering the graph it seems "Maintenance: Single Incidents/Maintenance: SLE 15 SP1 Incidents" uses as much disk space as one month ago. Besides that the big groups "SLE 15/Migration" and "SLE 15/Migration: Regression", "SLE 15/Functional" and "SLE 15/File Systems" grew a lot making them candidates to reduce the log storage duration.

And we could of course look into implementing the suggested feature although not to resolve the immediate issue. (Not sure how long it will take to implement the feature.)

Actions

Copy link

Updated by okurz about 5 years ago

mkittler wrote:

I mentioned the problem in the chat and removed the tags. Let's see how much disk space this frees when the cleanup runs.

If not we need to reduce the duration for keeping logs. Considering the graph it seems "Maintenance: Single Incidents/Maintenance: SLE 15 SP1 Incidents" uses as much disk space as one month ago. Besides that the big groups "SLE 15/Migration" and "SLE 15/Migration: Regression", "SLE 15/Functional" and "SLE 15/File Systems" grew a lot making them candidates to reduce the log storage duration.

And we could of course look into implementing the suggested feature although not to resolve the immediate issue. (Not sure how long it will take to implement the feature.)

I suggest we focus on the immediate problem first by working on the groups you mentioned. The feature proposals should be followed on after that and not be rushed.

Could you query some more details from the aforementioned job groups and either apply manual mitigation, e.g. remove video files, or whatever is the biggest offender? Also at best get in contact with job group maintainers and suggest to reduce the log retention periods with the additional hint to make tests "more efficient" by not producing so heavy results. There are tickets for this and they already know about this but they need to learn that a consequence of inefficient testing is also that we can only store less test results.

Actions

Copy link

Updated by pcervinka about 5 years ago

@yosun could you please align with @mkittler and improve configuration of SLE 15 / File Systems group in osd? thank you

Actions

Copy link

Updated by yosun about 5 years ago

I checked in filesystem job group, all logs very small, like KB or tens of KB.
One exception is video, because those filesystem tests take a lot of times, which makes video log big. I partly agree with Oliver's suggestion.
IMO, if technically possible, for filesystem job group we could reduce the video log retention periods only.

Actions

Copy link

Updated by mkittler about 5 years ago

Related to action #67087: Allow to configure retention period for the video individually added

Actions

Copy link

Updated by mkittler about 5 years ago

Related to action #64824: osd /results is at 99%, about to exceed available space added

Actions

Copy link

Updated by mkittler about 5 years ago

I've been adding #64824 as a reference to reduce retention periods.

Actions

Copy link

Updated by mkittler about 5 years ago

Status changed from Workable to Resolved
Target version deleted (~~Current Sprint~~)

I've been deleting videos older than 1 month and it freed up enough disk space so we're good again for the moment.

Here's a PR for automating the process if the disk usage gets critical again: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/307
This is basically a simple/stupid alternative to #67087. It is "stupid" in the sense that it doesn't distinguish between important and not important builds and it bypasses openQA's result size tracking. Bypassing openQA's result size tracking means that the "Approximate result size by job group" graph in our monitoring will not be aware of the removal and therefore might be quite inaccurate.

Since scenarios with long execution times seem to be the biggest offenders I created a PR to disable the video in those scenarios by default: https://github.com/os-autoinst/openQA/pull/3112

Since the immediate alert is solved I'm closing the ticket. Let's see how well the proposed PRs are accepted. Likely we nevertheless need to decrease retention periods in the future.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #66922

osd: /results cleanup, see alert

Acceptance criteria¶

Suggestions¶

Updated by okurz about 5 years ago

Updated by mkittler about 5 years ago

Updated by okurz about 5 years ago

Updated by pcervinka about 5 years ago

Updated by yosun about 5 years ago

Updated by mkittler about 5 years ago

Updated by mkittler about 5 years ago

Updated by mkittler about 5 years ago

Updated by mkittler about 5 years ago