Project

General

Profile

Actions

action #76822

closed

Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2020-10-30
Due date:
2020-11-13
% Done:

0%

Estimated time:

Description

Observation

see https://w3.nue.suse.com/~okurz/job_group_results_2020-10-30.png , there seems to be a very sudden increase in the job group "Maintenance: Test Repo/Maintenance: SLE 15 SP2 Updates". I wonder if someone changed result settings or just many recent results accumulated now. I will just monitor :)

EDIT: In 2020-11-04: we have seen an email alert from grafana for /results

Acceptance criteria

  • AC1: /results is way below the alarm threshold again to have headroom for some weeks at least

Suggestions

  • Review the trend of individual job groups
  • Reduce result retention periods after coordinating with job group stakeholders or owners

Related issues 1 (1 open0 closed)

Copied to openQA Project (public) - coordination #76984: [epic] Automatically remove assets+results based on available free spaceNew2021-01-21

Actions
Actions #1

Updated by okurz about 4 years ago

  • Tags set to storage, results, osd, job group settings
  • Subject changed from sudden increase in job group results for SLE 15 SP2 Incidents to Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)
  • Description updated (diff)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from Normal to Urgent

By now there was an email alert message confirming a problem:

[osd-admins] [Alerting] File systems alert
From:   Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id:    <osd-admins.suse.de>
Date:   03/11/2020 20.20

*/[Alerting] File systems alert/* 
results: Used Percentage 
90.008 

checking right now we are back to 85% after a sudden decrease in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&fullscreen&panelId=74&from=1604430390434&to=1604494503308 , likely simply the regular results cleanup.

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&fullscreen&panelId=74&from=1601949566797&to=1604500482068 shows an increase since the last 30 days from 70% to the point of the alert crossing 90%.

Actions #2

Updated by okurz about 4 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

discussion is going on what could have caused that, maybe it was a personal bug investigation action. I will see if this leads to conclusion and resolutions. Also, I prepared https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/396 so that we remove lengthy videos before we trigger any alerts. Also as the check first calls "df" the cost of calling it much more often is negligible. Extended https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/396 to cover that as well.

Actions #3

Updated by okurz about 4 years ago

  • Status changed from In Progress to Feedback

I called [ "$(df --output=pcent /results/testresults | sed '1d;s/[^0-9]//g')" -ge 84 ] && time find /results/testresults -type f -iname '*.ogv' -mtime +28 -delete explictly now and we are down to 81%. Enough headroom until my MR is approved for sure :)

Actions #4

Updated by okurz about 4 years ago

  • Copied to coordination #76984: [epic] Automatically remove assets+results based on available free space added
Actions #5

Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved

The manually triggered run of results video cleanup took 43m13.997s on osd now. MR was merged and change was deployed to osd. We are down to 79% usage on /results with 1.1T of free space.

Actions

Also available in: Atom PDF