action #76822: Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents) - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #76822

closed

Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-10-30

Due date:

2020-11-13

% Done:

Estimated time:

Tags:

osd, storage, results, job group settings

Description

Observation¶

see https://w3.nue.suse.com/~okurz/job_group_results_2020-10-30.png , there seems to be a very sudden increase in the job group "Maintenance: Test Repo/Maintenance: SLE 15 SP2 Updates". I wonder if someone changed result settings or just many recent results accumulated now. I will just monitor :)

EDIT: In 2020-11-04: we have seen an email alert from grafana for /results

Acceptance criteria¶

AC1: /results is way below the alarm threshold again to have headroom for some weeks at least

Suggestions¶

Review the trend of individual job groups
Reduce result retention periods after coordinating with job group stakeholders or owners

Related issues 1 (1 open — 0 closed)

Copied to openQA Project (public) - coordination #76984: [epic] Automatically remove assets+results based on available free space

New

2021-01-21

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz about 4 years ago

Tags set to storage, results, osd, job group settings
Subject changed from sudden increase in job group results for SLE 15 SP2 Incidents to Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)
Description updated (diff)
Status changed from Feedback to Workable
Assignee deleted (~~okurz~~)
Priority changed from Normal to Urgent

By now there was an email alert message confirming a problem:

[osd-admins] [Alerting] File systems alert
From:   Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id:    <osd-admins.suse.de>
Date:   03/11/2020 20.20

*/[Alerting] File systems alert/* 
results: Used Percentage 
90.008

checking right now we are back to 85% after a sudden decrease in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&fullscreen&panelId=74&from=1604430390434&to=1604494503308 , likely simply the regular results cleanup.

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&fullscreen&panelId=74&from=1601949566797&to=1604500482068 shows an increase since the last 30 days from 70% to the point of the alert crossing 90%.

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from Workable to In Progress
Assignee set to okurz

discussion is going on what could have caused that, maybe it was a personal bug investigation action. I will see if this leads to conclusion and resolutions. Also, I prepared https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/396 so that we remove lengthy videos before we trigger any alerts. Also as the check first calls "df" the cost of calling it much more often is negligible. Extended https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/396 to cover that as well.

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from In Progress to Feedback

I called [ "$(df --output=pcent /results/testresults | sed '1d;s/[^0-9]//g')" -ge 84 ] && time find /results/testresults -type f -iname '*.ogv' -mtime +28 -delete explictly now and we are down to 81%. Enough headroom until my MR is approved for sure :)

Actions

Copy link

Updated by okurz about 4 years ago

Copied to coordination #76984: [epic] Automatically remove assets+results based on available free space added

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from Feedback to Resolved

The manually triggered run of results video cleanup took 43m13.997s on osd now. MR was merged and change was deployed to osd. We are down to 79% usage on /results with 1.1T of free space.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #76822

Fix /results over-usage on osd (was: sudden increase in job group results for SLE 15 SP2 Incidents)

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago