action #167722: Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #167722

open

coordination #161414: [epic] Improved salt based infrastructure management

Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M

Added by okurz 7 months ago. Updated 4 months ago.

Status:

Workable

Priority:

Normal

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2024-10-02

Due date:

% Done:

Estimated time:

Tags:

space, monitor, influxdb, grafana, infra, vm, qamaster

Description

Observation¶

In #167719 we ran out of space because influxdb grew to 300G+ on monitor.qe.nue2.suse.org. We should look into which measurements consume the most space and ensure that we save space efficiently

Acceptance criteria¶

AC1: influxdb on monitor.qa.suse.de uses significantly less than 300G
AC2: We know the biggest space usage contributors in influxdb
AC3: We still have a reasonable history of important data, e.g. executed openQA jobs on OSD going back multiple months if not years

Suggestions¶

Research how to find out space usage in influxdb
Be aware about the concept of downsampling, retention periods, etc. which we already have
- … or not #103380
Look into https://community.home-assistant.io/t/influxdb-setup-to-compress-data-older-than-6-months-2-years/412379
Find candidates where data can be reduced, removed, optimised, compressed
Find out if there are maybe measurements that are not even used anywhere
Gather best practices for the future

Rollback steps¶

Remove the influxdb backup from the backup-vm again (currently in /home/nsinger)
Enable "backup-vm: partitions usage (%) alert" on https://stats.openqa-monitor.qa.suse.de/alerting/silences again

Out of scope¶

Increase disk space - this was already done in #167719

Related issues 4 (1 open — 3 closed)

Related to openQA Infrastructure (public) - action #103380: Configure retention/downsampling policy for specific monitoring data stored within InfluxDB

Blocked

okurz

2021-12-01

Actions

Has duplicate openQA Infrastructure (public) - action #169750: [alert] backup-vm (backup-vm: partitions usage (%) alert Generic partitions_usage_alert_backup-vm generic)

Resolved

nicksinger

2024-11-12

2024-11-27

Actions

Copied from openQA Infrastructure (public) - action #167719: No new data in monitor.qe.nue2.suse.org due to influxdb failing to write with ""error opening new segment file for wal (1): write /var/lib/influxdb/….wal: no space left on device"

Resolved

okurz

2024-10-02

Actions

Copied to openQA Infrastructure (public) - action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:S

Resolved

gpathak

2024-10-02

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #167722

Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by openqa_review 7 months ago

Updated by livdywan 7 months ago

Updated by mkittler 6 months ago · Edited

Updated by nicksinger 6 months ago

Updated by nicksinger 6 months ago

Updated by ybonatakis 6 months ago

Updated by okurz 6 months ago

Updated by nicksinger 6 months ago · Edited

Updated by okurz 5 months ago

Updated by nicksinger 5 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago