action #76822: Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents) - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #76822

closed

Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-10-30

Due date:

2020-11-13

% Done:

Estimated time:

Tags:

osd, storage, results, job group settings

Description

Observation¶

see https://w3.nue.suse.com/~okurz/job_group_results_2020-10-30.png , there seems to be a very sudden increase in the job group "Maintenance: Test Repo/Maintenance: SLE 15 SP2 Updates". I wonder if someone changed result settings or just many recent results accumulated now. I will just monitor :)

EDIT: In 2020-11-04: we have seen an email alert from grafana for /results

Acceptance criteria¶

AC1: /results is way below the alarm threshold again to have headroom for some weeks at least

Suggestions¶

Review the trend of individual job groups
Reduce result retention periods after coordinating with job group stakeholders or owners

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Tags set to storage, results, osd, job group settings
Subject changed from sudden increase in job group results for SLE 15 SP2 Incidents to Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)
Description updated (diff)
Status changed from Feedback to Workable
Assignee deleted (~~okurz~~)
Priority changed from Normal to Urgent

By now there was an email alert message confirming a problem:

[osd-admins] [Alerting] File systems alert
From:	Grafana <osd-admins@suse.de>
To:	osd-admins@suse.de
Sender:	osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id:	<osd-admins.suse.de>
Date:	03/11/2020 20.20

*/[Alerting] File systems alert/* 
results: Used Percentage 
90.008

checking right now we are back to 85% after a sudden decrease in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&fullscreen&panelId=74&from=1604430390434&to=1604494503308 , likely simply the regular results cleanup.

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&fullscreen&panelId=74&from=1601949566797&to=1604500482068 shows an increase since the last 30 days from 70% to the point of the alert crossing 90%.

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Workable to In Progress
Assignee set to okurz

discussion is going on what could have caused that, maybe it was a personal bug investigation action. I will see if this leads to conclusion and resolutions. Also, I prepared https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/396 so that we remove lengthy videos before we trigger any alerts. Also as the check first calls "df" the cost of calling it much more often is negligible. Extended https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/396 to cover that as well.

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from In Progress to Feedback

I called [ "$(df --output=pcent /results/testresults | sed '1d;s/[^0-9]//g')" -ge 84 ] && time find /results/testresults -type f -iname '*.ogv' -mtime +28 -delete explictly now and we are down to 81%. Enough headroom until my MR is approved for sure :)

Actions

Copy link

Updated by okurz over 4 years ago

Copied to coordination #76984: [epic] Automatically remove assets+results based on available free space added

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Feedback to Resolved

The manually triggered run of results video cleanup took 43m13.997s on osd now. MR was merged and change was deployed to osd. We are down to 79% usage on /results with 1.1T of free space.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #76822

Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago