action #66922: osd: /results cleanup, see alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #66922

closed

osd: /results cleanup, see alert

Added by okurz almost 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

Start date:

2020-05-17

Due date:

% Done:

Estimated time:

Description

Acceptance criteria¶

AC1: alert on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&panelId=74&fullscreen&edit&tab=alert&refresh=30s has vanished

Suggestions¶

Lookup tickets of last time we have handled this
Decide based on https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&fullscreen&panelId=19 which job group can be reduced
Inform users
Optionally ask for space increase again

Related issues 2 (1 open — 1 closed)

Related to openQA Project (public) - action #67087: Allow to configure retention period for the video individually

New

2020-05-20

Actions

Related to openQA Infrastructure (public) - action #64824: osd /results is at 99%, about to exceed available space

Resolved

okurz

2020-03-25

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz almost 5 years ago

Assignee set to mkittler

@mkittler inform QA SLE that important builds take up a lot of space and ensure to delete important build tags on https://openqa.suse.de/parent_group_overview/15 for older candidates

Ideas for new features: To save space in openqa keep dot of job results in database for longer but delete everything on filesystem maybe even move database content to slow different database; quota for results and logs to show that important builds take up more space

@nsinger ask SUSE IT if we can have less flash storage but more rotating disk storage

Actions

Copy link

Updated by mkittler almost 5 years ago

I mentioned the problem in the chat and removed the tags. Let's see how much disk space this frees when the cleanup runs.

If not we need to reduce the duration for keeping logs. Considering the graph it seems "Maintenance: Single Incidents/Maintenance: SLE 15 SP1 Incidents" uses as much disk space as one month ago. Besides that the big groups "SLE 15/Migration" and "SLE 15/Migration: Regression", "SLE 15/Functional" and "SLE 15/File Systems" grew a lot making them candidates to reduce the log storage duration.

And we could of course look into implementing the suggested feature although not to resolve the immediate issue. (Not sure how long it will take to implement the feature.)

Actions

Copy link

Updated by okurz almost 5 years ago

mkittler wrote:

I mentioned the problem in the chat and removed the tags. Let's see how much disk space this frees when the cleanup runs.

If not we need to reduce the duration for keeping logs. Considering the graph it seems "Maintenance: Single Incidents/Maintenance: SLE 15 SP1 Incidents" uses as much disk space as one month ago. Besides that the big groups "SLE 15/Migration" and "SLE 15/Migration: Regression", "SLE 15/Functional" and "SLE 15/File Systems" grew a lot making them candidates to reduce the log storage duration.

And we could of course look into implementing the suggested feature although not to resolve the immediate issue. (Not sure how long it will take to implement the feature.)

I suggest we focus on the immediate problem first by working on the groups you mentioned. The feature proposals should be followed on after that and not be rushed.

Could you query some more details from the aforementioned job groups and either apply manual mitigation, e.g. remove video files, or whatever is the biggest offender? Also at best get in contact with job group maintainers and suggest to reduce the log retention periods with the additional hint to make tests "more efficient" by not producing so heavy results. There are tickets for this and they already know about this but they need to learn that a consequence of inefficient testing is also that we can only store less test results.

Actions

Copy link

Updated by pcervinka almost 5 years ago

@yosun could you please align with @mkittler and improve configuration of SLE 15 / File Systems group in osd? thank you

Actions

Copy link

Updated by yosun almost 5 years ago

I checked in filesystem job group, all logs very small, like KB or tens of KB.
One exception is video, because those filesystem tests take a lot of times, which makes video log big. I partly agree with Oliver's suggestion.
IMO, if technically possible, for filesystem job group we could reduce the video log retention periods only.

Actions

Copy link

Updated by mkittler almost 5 years ago

Related to action #67087: Allow to configure retention period for the video individually added

Actions

Copy link

Updated by mkittler almost 5 years ago

Related to action #64824: osd /results is at 99%, about to exceed available space added

Actions

Copy link

Updated by mkittler almost 5 years ago

I've been adding #64824 as a reference to reduce retention periods.

Actions

Copy link

Updated by mkittler almost 5 years ago

Status changed from Workable to Resolved
Target version deleted (~~Current Sprint~~)

I've been deleting videos older than 1 month and it freed up enough disk space so we're good again for the moment.

Here's a PR for automating the process if the disk usage gets critical again: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/307

This is basically a simple/stupid alternative to #67087. It is "stupid" in the sense that it doesn't distinguish between important and not important builds and it bypasses openQA's result size tracking. Bypassing openQA's result size tracking means that the "Approximate result size by job group" graph in our monitoring will not be aware of the removal and therefore might be quite inaccurate.

Since scenarios with long execution times seem to be the biggest offenders I created a PR to disable the video in those scenarios by default: https://github.com/os-autoinst/openQA/pull/3112

Since the immediate alert is solved I'm closing the ticket. Let's see how well the proposed PRs are accepted. Likely we nevertheless need to decrease retention periods in the future.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #66922

osd: /results cleanup, see alert

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 5 years ago

Updated by mkittler almost 5 years ago

Updated by okurz almost 5 years ago

Updated by pcervinka almost 5 years ago

Updated by yosun almost 5 years ago

Updated by mkittler almost 5 years ago

Updated by mkittler almost 5 years ago

Updated by mkittler almost 5 years ago

Updated by mkittler almost 5 years ago