action #167722: Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #167722

open

coordination #161414: [epic] Improved salt based infrastructure management

Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M

Added by okurz 7 months ago. Updated 4 months ago.

Status:

Workable

Priority:

Normal

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2024-10-02

Due date:

% Done:

Estimated time:

Tags:

space, monitor, influxdb, grafana, infra, vm, qamaster

Description

Observation¶

In #167719 we ran out of space because influxdb grew to 300G+ on monitor.qe.nue2.suse.org. We should look into which measurements consume the most space and ensure that we save space efficiently

Acceptance criteria¶

AC1: influxdb on monitor.qa.suse.de uses significantly less than 300G
AC2: We know the biggest space usage contributors in influxdb
AC3: We still have a reasonable history of important data, e.g. executed openQA jobs on OSD going back multiple months if not years

Suggestions¶

Research how to find out space usage in influxdb
Be aware about the concept of downsampling, retention periods, etc. which we already have
- … or not #103380
Look into https://community.home-assistant.io/t/influxdb-setup-to-compress-data-older-than-6-months-2-years/412379
Find candidates where data can be reduced, removed, optimised, compressed
Find out if there are maybe measurements that are not even used anywhere
Gather best practices for the future

Rollback steps¶

Remove the influxdb backup from the backup-vm again (currently in /home/nsinger)
Enable "backup-vm: partitions usage (%) alert" on https://stats.openqa-monitor.qa.suse.de/alerting/silences again

Out of scope¶

Increase disk space - this was already done in #167719

Related issues 4 (1 open — 3 closed)

Related to openQA Infrastructure (public) - action #103380: Configure retention/downsampling policy for specific monitoring data stored within InfluxDB

Blocked

okurz

2021-12-01

Actions

Has duplicate openQA Infrastructure (public) - action #169750: [alert] backup-vm (backup-vm: partitions usage (%) alert Generic partitions_usage_alert_backup-vm generic)

Resolved

nicksinger

2024-11-12

2024-11-27

Actions

Copied from openQA Infrastructure (public) - action #167719: No new data in monitor.qe.nue2.suse.org due to influxdb failing to write with ""error opening new segment file for wal (1): write /var/lib/influxdb/….wal: no space left on device"

Resolved

okurz

2024-10-02

Actions

Copied to openQA Infrastructure (public) - action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:S

Resolved

gpathak

2024-10-02

Actions

Copy link

Updated by okurz 7 months ago

Copied from action #167719: No new data in monitor.qe.nue2.suse.org due to influxdb failing to write with ""error opening new segment file for wal (1): write /var/lib/influxdb/….wal: no space left on device" added

Actions

Copy link

Updated by okurz 7 months ago

Parent task set to #161414

Actions

Copy link

Updated by okurz 7 months ago

Copied to action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:S added

Actions

Copy link

Updated by okurz 7 months ago

Subject changed from Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org to Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 7 months ago

Related to action #103380: Configure retention/downsampling policy for specific monitoring data stored within InfluxDB added

Actions

Copy link

Updated by nicksinger 7 months ago

Status changed from Workable to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger 7 months ago

I was reading a bit about the future of influxDB to better understand what we need and want to invest time into. We still run v1, current is v2 and v3 is on the horizon. The migration from v1 to v2 seems to be a bigger task because major concepts changed but we might get away easily because we never heavily made use of db-features itself. v2 to v3 apparently will happen in the same repository (point 1 in https://www.influxdata.com/blog/the-plan-for-influxdb-3-0-open-source/) and should be a more easy migration.

To implement any retention policy or down-sampling in our current setup we have to use "continuous queries" (CQs) and "retention policies" (RPs) (https://docs.influxdata.com/influxdb/v1/guides/downsample_and_retain/), at least QCs we will have to migrate once we switch to v2: https://docs.influxdata.com/influxdb/v2/install/upgrade/v1-to-v2/migrate-cqs/

RPs will be linked to "buckets" in future versions so they maybe can be migrated automatically.

Given that I will first focus on finding, reducing and cleaning "unused" metrics and we can eventually plan an (easy) migration to v2 before implementing CQs and RPs.

Actions

Copy link

Updated by openqa_review 7 months ago

Due date set to 2024-11-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by livdywan 7 months ago

Status changed from In Progress to Workable

Setting back to Workable based on availability

Actions

Copy link

#10

Updated by mkittler 6 months ago · Edited

I suppose it would at some point make sense to switch to v2 and not make out setup more complicated before doing that.

When I previously suggested to migrate to v2 this was opposed with the argument that Leap is not providing a package. Maybe the situation has changed now, though. (And I think we are likely able to build a package of v2 ourselves if needed and that shouldn't block us here.)

Actions

Copy link

#11

Updated by nicksinger 6 months ago

Due date changed from 2024-11-07 to 2024-11-15

mkittler wrote in #note-10:

I suppose it would at some point make sense to switch to v2 and not make out setup more complicated before doing that.

When I previously suggested to migrate to v2 this was opposed with the argument that Leap is not providing a package. Maybe the situation has changed now, though. (And I think we are likely able to build a package of v2 ourselves if needed and that shouldn't block us here.)

yes, makes sense. @okurz also mentioned that a upgrade might make sense. I checked and saw that v2 is at least available in the monitoring repository which we use for grafana already so I now look into the upgrade. But first I need to create a backup of the existing data which I need to do on monitor and apparently I need to install influxdb for that. Going to install and start a backup now. Also bumping the due date because I don't expect to finish this ticket today.

Actions

Copy link

#12

Updated by nicksinger 6 months ago

I had to install the influxdb package on backup because it is required to issue the backup command. I connected to backup-vm with ssh agent forwarding and then did a port-forward onto the monitoring host with:

ssh root@monitor.qa.suse.de -L 8088:localhost:8088

afterwards I can issue a backup from "backup" itself with:

nsinger@backup-vm:~/monitor.qa.suse.de/influxdb> influxd backup -portable -host 127.0.0.1:8088 .

which is currently running

Actions

Copy link

#13

Updated by ybonatakis 6 months ago

Has duplicate action #169750: [alert] backup-vm (backup-vm: partitions usage (%) alert Generic partitions_usage_alert_backup-vm generic) added

Actions

Copy link

#14

Updated by okurz 6 months ago

Due date changed from 2024-11-15 to 2024-11-29

as discussed in weekly coordination

Actions

Copy link

#15

Updated by nicksinger 6 months ago · Edited

Description updated (diff)

backup vm filled up and caused an alert. As mentioned in #169750#note-6 this ticket here will also take care of cleaning up the old backup.

Actions

Copy link

#16

Updated by okurz 5 months ago

Due date deleted (~~2024-11-29~~)

Actions

Copy link

#17

Updated by nicksinger 5 months ago

Assignee deleted (~~nicksinger~~)

I tested the migration locally on my computer and it worked pretty flawless. However, I was unable to spot any old metric in the new interface so I have to cross-test with an actual Grafana instance. I didn't manage to do this yet and will not manage to do it this year.

Actions

Copy link

#18

Updated by okurz 4 months ago

Priority changed from Normal to High

Actions

Copy link

#19

Updated by okurz 4 months ago

Priority changed from High to Normal
Target version changed from Ready to future

Based on discussing with nicksinger if graphs become too slow we just have to live with it. If storage depletes again we just have to increase storage so removing the ticket from the backlog and reducing prio.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #167722

Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by openqa_review 7 months ago

Updated by livdywan 7 months ago

Updated by mkittler 6 months ago · Edited

Updated by nicksinger 6 months ago

Updated by nicksinger 6 months ago

Updated by ybonatakis 6 months ago

Updated by okurz 6 months ago

Updated by nicksinger 6 months ago · Edited

Updated by okurz 5 months ago

Updated by nicksinger 5 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago