action #100985: Come up with a way to regularly check job group configs for outliers and misconfiguration, e.g. overly long result retention periods - openQA Infrastructure - openSUSE Project Management Tool

action #100985

## Motivation 
 As investigated in #100859 the database Over 80% space of /srv on OSD grew considerably big. are used up. Most of this data is used by our postgresql database. 
 I raised this concern in [slack](https://suse.slack.com/archives/C02AJ1E568M/p1634030652186600) where some possible reasons where discussed. One of the reasons was them is to figure out why postgresql uses so much data. @mkittler mentioned that there were many jobs which are likely uninteresting and also they are kept around for long due to overly long result retention periods, also a fresh database import lowers disk space consumption drastically. 

 Also see https://suse.slack.com/archives/C02AJ1E568M/p1634060017258600?thread_ts=1634030652.186600&cid=C02AJ1E568M [poo#89821](https://progress.opensuse.org/issues/89821) for context. some history about the alert itself. 

 ## Acceptance Criteria 
 **AC1**: We regularly check job group configs for outliers and misconfiguration, e.g. alert for overly long result retention periods Alert does not trigger any longer 
 **AC2**: Understand why our production database uses the space it uses 

 ## Suggestions 
 * query job group configuration using SQL with Enlarge partition by opening an eng-infra ticket and ask for some limits and alert more space for /dev/vdb 
 * Figure out if the query returns anything disk utilization of our database can be optimized 
 * DONE by coolo: Try if the disk utilization can be reduced. E.g. by running the postgresql vaccum  
 * ~~See if an auto vaccum can be configured or if thresholds can be lowered (https://suse.slack.com/archives/C02AJ1E568M/p1634033225193500?thread_ts=1634030652.186600&cid=C02AJ1E568M)~~ -> #100979 
 * ~~better alerts~~ -> #100976

Back

Project

General

Profile

QA » openQA Project » openQA Infrastructure

action #100985