Project

General

Profile

Actions

action #167722

open

coordination #161414: [epic] Improved salt based infrastructure management

Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M

Added by okurz 5 months ago. Updated about 2 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-10-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

In #167719 we ran out of space because influxdb grew to 300G+ on monitor.qe.nue2.suse.org. We should look into which measurements consume the most space and ensure that we save space efficiently

Acceptance criteria

  • AC1: influxdb on monitor.qa.suse.de uses significantly less than 300G
  • AC2: We know the biggest space usage contributors in influxdb
  • AC3: We still have a reasonable history of important data, e.g. executed openQA jobs on OSD going back multiple months if not years

Suggestions

Rollback steps

Out of scope

  • Increase disk space - this was already done in #167719

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure (public) - action #103380: Configure retention/downsampling policy for specific monitoring data stored within InfluxDBBlockedokurz2021-12-01

Actions
Has duplicate openQA Infrastructure (public) - action #169750: [alert] backup-vm (backup-vm: partitions usage (%) alert Generic partitions_usage_alert_backup-vm generic)Resolvednicksinger2024-11-122024-11-27

Actions
Copied from openQA Infrastructure (public) - action #167719: No new data in monitor.qe.nue2.suse.org due to influxdb failing to write with ""error opening new segment file for wal (1): write /var/lib/influxdb/….wal: no space left on device"Resolvedokurz2024-10-02

Actions
Copied to openQA Infrastructure (public) - action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:SResolvedgpathak2024-10-02

Actions
Actions #1

Updated by okurz 5 months ago

  • Copied from action #167719: No new data in monitor.qe.nue2.suse.org due to influxdb failing to write with ""error opening new segment file for wal (1): write /var/lib/influxdb/….wal: no space left on device" added
Actions #2

Updated by okurz 5 months ago

  • Parent task set to #161414
Actions #3

Updated by okurz 5 months ago

  • Copied to action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:S added
Actions #4

Updated by okurz 5 months ago

  • Subject changed from Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org to Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz 5 months ago

  • Related to action #103380: Configure retention/downsampling policy for specific monitoring data stored within InfluxDB added
Actions #6

Updated by nicksinger 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #7

Updated by nicksinger 5 months ago

I was reading a bit about the future of influxDB to better understand what we need and want to invest time into. We still run v1, current is v2 and v3 is on the horizon. The migration from v1 to v2 seems to be a bigger task because major concepts changed but we might get away easily because we never heavily made use of db-features itself. v2 to v3 apparently will happen in the same repository (point 1 in https://www.influxdata.com/blog/the-plan-for-influxdb-3-0-open-source/) and should be a more easy migration.

To implement any retention policy or down-sampling in our current setup we have to use "continuous queries" (CQs) and "retention policies" (RPs) (https://docs.influxdata.com/influxdb/v1/guides/downsample_and_retain/), at least QCs we will have to migrate once we switch to v2: https://docs.influxdata.com/influxdb/v2/install/upgrade/v1-to-v2/migrate-cqs/

RPs will be linked to "buckets" in future versions so they maybe can be migrated automatically.

Given that I will first focus on finding, reducing and cleaning "unused" metrics and we can eventually plan an (easy) migration to v2 before implementing CQs and RPs.

Actions #8

Updated by openqa_review 5 months ago

  • Due date set to 2024-11-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by livdywan 5 months ago

  • Status changed from In Progress to Workable

Setting back to Workable based on availability

Actions #10

Updated by mkittler 4 months ago · Edited

I suppose it would at some point make sense to switch to v2 and not make out setup more complicated before doing that.

When I previously suggested to migrate to v2 this was opposed with the argument that Leap is not providing a package. Maybe the situation has changed now, though. (And I think we are likely able to build a package of v2 ourselves if needed and that shouldn't block us here.)

Actions #11

Updated by nicksinger 4 months ago

  • Due date changed from 2024-11-07 to 2024-11-15

mkittler wrote in #note-10:

I suppose it would at some point make sense to switch to v2 and not make out setup more complicated before doing that.

When I previously suggested to migrate to v2 this was opposed with the argument that Leap is not providing a package. Maybe the situation has changed now, though. (And I think we are likely able to build a package of v2 ourselves if needed and that shouldn't block us here.)

yes, makes sense. @okurz also mentioned that a upgrade might make sense. I checked and saw that v2 is at least available in the monitoring repository which we use for grafana already so I now look into the upgrade. But first I need to create a backup of the existing data which I need to do on monitor and apparently I need to install influxdb for that. Going to install and start a backup now. Also bumping the due date because I don't expect to finish this ticket today.

Actions #12

Updated by nicksinger 4 months ago

I had to install the influxdb package on backup because it is required to issue the backup command. I connected to backup-vm with ssh agent forwarding and then did a port-forward onto the monitoring host with:

ssh root@monitor.qa.suse.de -L 8088:localhost:8088

afterwards I can issue a backup from "backup" itself with:

nsinger@backup-vm:~/monitor.qa.suse.de/influxdb> influxd backup -portable -host 127.0.0.1:8088 .

which is currently running

Actions #13

Updated by ybonatakis 4 months ago

  • Has duplicate action #169750: [alert] backup-vm (backup-vm: partitions usage (%) alert Generic partitions_usage_alert_backup-vm generic) added
Actions #14

Updated by okurz 4 months ago

  • Due date changed from 2024-11-15 to 2024-11-29

as discussed in weekly coordination

Actions #15

Updated by nicksinger 4 months ago · Edited

  • Description updated (diff)

backup vm filled up and caused an alert. As mentioned in #169750#note-6 this ticket here will also take care of cleaning up the old backup.

Actions #16

Updated by okurz 3 months ago

  • Due date deleted (2024-11-29)
Actions #17

Updated by nicksinger 3 months ago

  • Assignee deleted (nicksinger)

I tested the migration locally on my computer and it worked pretty flawless. However, I was unable to spot any old metric in the new interface so I have to cross-test with an actual Grafana instance. I didn't manage to do this yet and will not manage to do it this year.

Actions #18

Updated by okurz about 2 months ago

  • Priority changed from Normal to High
Actions #19

Updated by okurz about 2 months ago

  • Priority changed from High to Normal
  • Target version changed from Ready to future

Based on discussing with nicksinger if graphs become too slow we just have to live with it. If storage depletes again we just have to increase storage so removing the ticket from the backlog and reducing prio.

Actions

Also available in: Atom PDF