action #167722
open
coordination #161414: [epic] Improved salt based infrastructure management
Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M
Added by okurz 7 months ago.
Updated 4 months ago.
Category:
Feature requests
Description
Observation¶
In #167719 we ran out of space because influxdb grew to 300G+ on monitor.qe.nue2.suse.org. We should look into which measurements consume the most space and ensure that we save space efficiently
Acceptance criteria¶
-
AC1: influxdb on monitor.qa.suse.de uses significantly less than 300G
-
AC2: We know the biggest space usage contributors in influxdb
-
AC3: We still have a reasonable history of important data, e.g. executed openQA jobs on OSD going back multiple months if not years
Suggestions¶
Rollback steps¶
Out of scope¶
- Increase disk space - this was already done in #167719
- Copied from action #167719: No new data in monitor.qe.nue2.suse.org due to influxdb failing to write with ""error opening new segment file for wal (1): write /var/lib/influxdb/….wal: no space left on device" added
- Parent task set to #161414
- Copied to action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:S added
- Subject changed from Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org to Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M
- Description updated (diff)
- Status changed from New to Workable
- Related to action #103380: Configure retention/downsampling policy for specific monitoring data stored within InfluxDB added
- Status changed from Workable to In Progress
- Assignee set to nicksinger
I was reading a bit about the future of influxDB to better understand what we need and want to invest time into. We still run v1, current is v2 and v3 is on the horizon. The migration from v1 to v2 seems to be a bigger task because major concepts changed but we might get away easily because we never heavily made use of db-features itself. v2 to v3 apparently will happen in the same repository (point 1 in https://www.influxdata.com/blog/the-plan-for-influxdb-3-0-open-source/) and should be a more easy migration.
To implement any retention policy or down-sampling in our current setup we have to use "continuous queries" (CQs) and "retention policies" (RPs) (https://docs.influxdata.com/influxdb/v1/guides/downsample_and_retain/), at least QCs we will have to migrate once we switch to v2: https://docs.influxdata.com/influxdb/v2/install/upgrade/v1-to-v2/migrate-cqs/
RPs will be linked to "buckets" in future versions so they maybe can be migrated automatically.
Given that I will first focus on finding, reducing and cleaning "unused" metrics and we can eventually plan an (easy) migration to v2 before implementing CQs and RPs.
- Due date set to 2024-11-07
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Workable
Setting back to Workable based on availability
I suppose it would at some point make sense to switch to v2 and not make out setup more complicated before doing that.
When I previously suggested to migrate to v2 this was opposed with the argument that Leap is not providing a package. Maybe the situation has changed now, though. (And I think we are likely able to build a package of v2 ourselves if needed and that shouldn't block us here.)
- Due date changed from 2024-11-07 to 2024-11-15
mkittler wrote in #note-10:
I suppose it would at some point make sense to switch to v2 and not make out setup more complicated before doing that.
When I previously suggested to migrate to v2 this was opposed with the argument that Leap is not providing a package. Maybe the situation has changed now, though. (And I think we are likely able to build a package of v2 ourselves if needed and that shouldn't block us here.)
yes, makes sense. @okurz also mentioned that a upgrade might make sense. I checked and saw that v2 is at least available in the monitoring repository which we use for grafana already so I now look into the upgrade. But first I need to create a backup of the existing data which I need to do on monitor and apparently I need to install influxdb for that. Going to install and start a backup now. Also bumping the due date because I don't expect to finish this ticket today.
I had to install the influxdb package on backup because it is required to issue the backup command. I connected to backup-vm with ssh agent forwarding and then did a port-forward onto the monitoring host with:
ssh root@monitor.qa.suse.de -L 8088:localhost:8088
afterwards I can issue a backup from "backup" itself with:
nsinger@backup-vm:~/monitor.qa.suse.de/influxdb> influxd backup -portable -host 127.0.0.1:8088 .
which is currently running
- Has duplicate action #169750: [alert] backup-vm (backup-vm: partitions usage (%) alert Generic partitions_usage_alert_backup-vm generic) added
- Due date changed from 2024-11-15 to 2024-11-29
as discussed in weekly coordination
- Description updated (diff)
backup vm filled up and caused an alert. As mentioned in #169750#note-6 this ticket here will also take care of cleaning up the old backup.
- Due date deleted (
2024-11-29)
- Assignee deleted (
nicksinger)
I tested the migration locally on my computer and it worked pretty flawless. However, I was unable to spot any old metric in the new interface so I have to cross-test with an actual Grafana instance. I didn't manage to do this yet and will not manage to do it this year.
- Priority changed from Normal to High
- Priority changed from High to Normal
- Target version changed from Ready to future
Based on discussing with nicksinger if graphs become too slow we just have to live with it. If storage depletes again we just have to increase storage so removing the ticket from the backlog and reducing prio.
Also available in: Atom
PDF