action #66922
closedosd: /results cleanup, see alert
Added by okurz over 4 years ago. Updated over 4 years ago.
0%
Description
Acceptance criteria¶
- AC1: alert on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&panelId=74&fullscreen&edit&tab=alert&refresh=30s has vanished
Suggestions¶
- Lookup tickets of last time we have handled this
- Decide based on https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&fullscreen&panelId=19 which job group can be reduced
- Inform users
- Optionally ask for space increase again
Updated by okurz over 4 years ago
- Assignee set to mkittler
@mkittler inform QA SLE that important builds take up a lot of space and ensure to delete important build tags on https://openqa.suse.de/parent_group_overview/15 for older candidates
Ideas for new features: To save space in openqa keep dot of job results in database for longer but delete everything on filesystem maybe even move database content to slow different database; quota for results and logs to show that important builds take up more space
@nsinger ask SUSE IT if we can have less flash storage but more rotating disk storage
Updated by mkittler over 4 years ago
I mentioned the problem in the chat and removed the tags. Let's see how much disk space this frees when the cleanup runs.
If not we need to reduce the duration for keeping logs. Considering the graph it seems "Maintenance: Single Incidents/Maintenance: SLE 15 SP1 Incidents" uses as much disk space as one month ago. Besides that the big groups "SLE 15/Migration" and "SLE 15/Migration: Regression", "SLE 15/Functional" and "SLE 15/File Systems" grew a lot making them candidates to reduce the log storage duration.
And we could of course look into implementing the suggested feature although not to resolve the immediate issue. (Not sure how long it will take to implement the feature.)
Updated by okurz over 4 years ago
mkittler wrote:
I mentioned the problem in the chat and removed the tags. Let's see how much disk space this frees when the cleanup runs.
If not we need to reduce the duration for keeping logs. Considering the graph it seems "Maintenance: Single Incidents/Maintenance: SLE 15 SP1 Incidents" uses as much disk space as one month ago. Besides that the big groups "SLE 15/Migration" and "SLE 15/Migration: Regression", "SLE 15/Functional" and "SLE 15/File Systems" grew a lot making them candidates to reduce the log storage duration.
And we could of course look into implementing the suggested feature although not to resolve the immediate issue. (Not sure how long it will take to implement the feature.)
I suggest we focus on the immediate problem first by working on the groups you mentioned. The feature proposals should be followed on after that and not be rushed.
Could you query some more details from the aforementioned job groups and either apply manual mitigation, e.g. remove video files, or whatever is the biggest offender? Also at best get in contact with job group maintainers and suggest to reduce the log retention periods with the additional hint to make tests "more efficient" by not producing so heavy results. There are tickets for this and they already know about this but they need to learn that a consequence of inefficient testing is also that we can only store less test results.
Updated by yosun over 4 years ago
I checked in filesystem job group, all logs very small, like KB or tens of KB.
One exception is video, because those filesystem tests take a lot of times, which makes video log big. I partly agree with Oliver's suggestion.
IMO, if technically possible, for filesystem job group we could reduce the video log retention periods only.
Updated by mkittler over 4 years ago
- Related to action #67087: Allow to configure retention period for the video individually added
Updated by mkittler over 4 years ago
- Related to action #64824: osd /results is at 99%, about to exceed available space added
Updated by mkittler over 4 years ago
I've been adding #64824 as a reference to reduce retention periods.
Updated by mkittler over 4 years ago
- Status changed from Workable to Resolved
- Target version deleted (
Current Sprint)
I've been deleting videos older than 1 month and it freed up enough disk space so we're good again for the moment.
Here's a PR for automating the process if the disk usage gets critical again: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/307
This is basically a simple/stupid alternative to #67087. It is "stupid" in the sense that it doesn't distinguish between important and not important builds and it bypasses openQA's result size tracking. Bypassing openQA's result size tracking means that the "Approximate result size by job group" graph in our monitoring will not be aware of the removal and therefore might be quite inaccurate.
Since scenarios with long execution times seem to be the biggest offenders I created a PR to disable the video in those scenarios by default: https://github.com/os-autoinst/openQA/pull/3112
Since the immediate alert is solved I'm closing the ticket. Let's see how well the proposed PRs are accepted. Likely we nevertheless need to decrease retention periods in the future.