Project

General

Profile

action #94492

Configure retention/downsampling policy for monitoring data stored within InfluxDB size:M

Added by mkittler 6 months ago. Updated 5 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-06-22
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Currently the default retention period is in-place:

> use telegraf
Using database telegraf
> show retention policies
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

According to the documentation that means we store the data infinitely¹.

It would be possible to configure a time-based retention policy per database (we store all our monitoring data in one database called telegraf): https://docs.influxdata.com/influxdb/v1.8/query_language/manage-database/#create-retention-policies-with-create-retention-policy
This raises the question which durations we'd like to configure. It would also make sense to "downsample and retain data" to store data with less granularity over time.

I haven't found anything about disk-space-based retention policies within the documentation of InfluxDB so I assume this is not an option.


¹

When you create a database, InfluxDB creates a retention policy called autogen with an infinite duration…

Acceptance Criteria

  • AC1: The directory monitor.qa.suse.de:/var/lib/influxdb has significantly less than the current 120GB
  • AC2: Grafana is able to show at least 6 months worth of (optionally downsampled) data

Suggestions

  • Ensure we have a backup before the process remove data
  • Check what the current InfluxDB version 1.8 provides
  • If we have downsampling, we should aim for at least 1 month of full-sampled data data
  • Checkout previous attempts to export data
    • dump is still lying around at /var/lib/influxdb/dumps/postgresql; it can be deleted if the disk space is needed
    • see #93922#note-10 but do not make the same mistake of using grep again for filtering

Related issues

Related to openQA Infrastructure - action #103380: Configure retention/downsampling policy for specific monitoring data stored within InfluxDBNew2021-12-01

Copied to openQA Infrastructure - action #97583: [spike] Configure retention/downsampling policy for monitoring data stored within InfluxDB on newer influxdb versionRejected2021-06-22

History

#1 Updated by okurz 6 months ago

isn't this time series database meant to store data with less granularity over time?

#2 Updated by okurz 6 months ago

  • Target version set to Ready

#3 Updated by mkittler 6 months ago

  • Description updated (diff)

Yes, that would make sense. They're calling it downsampling. I've added the relevant link to the ticket description.

#4 Updated by ilausuch 5 months ago

  • Subject changed from Configure retention policy for monitoring data stored within InfluxDB to Configure retention policy for monitoring data stored within InfluxDB size: M
  • Description updated (diff)

#5 Updated by ilausuch 5 months ago

  • Subject changed from Configure retention policy for monitoring data stored within InfluxDB size: M to Configure retention policy for monitoring data stored within InfluxDB size:M

#6 Updated by okurz 5 months ago

  • Subject changed from Configure retention policy for monitoring data stored within InfluxDB size:M to Configure retention/downsampling policy for monitoring data stored within InfluxDB size:M
  • Description updated (diff)
  • Status changed from New to Workable

Fixes and additions as discussed verbally with the team in the joint estimation session

#7 Updated by mkittler 5 months ago

  • Description updated (diff)
  • Target version deleted (Ready)

Mention previous attempts to export data

#8 Updated by okurz 5 months ago

  • Target version set to Ready

#9 Updated by okurz 3 months ago

  • Copied to action #97583: [spike] Configure retention/downsampling policy for monitoring data stored within InfluxDB on newer influxdb version added

#10 Updated by okurz 3 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

blocked on #97583

#11 Updated by okurz 3 months ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

I rejected #97583 as I checked on OBS and we do not have a more recent version than our production influxdb 1.8 anywhere on OBS. So this ticket is really about what is currently provided in openSUSE packages, i.e. currently 1.8.6

#12 Updated by mkittler about 1 month ago

  • Status changed from Workable to Feedback
  • Assignee set to mkittler

Judging by the documentation it looks like this will be rather coarse, static and lots of manual effort.


Here some details. First, we needed to change the default retention policy, e.g. the following command would cause new raw data to be deleted after a half year:

CREATE RETENTION POLICY "half_year" ON "telegraf" DURATION 180d REPLICATION 1 DEFAULT

Then we'd create a 2nd retention policy for "downsampled" data, e.g. to store it for approximately two years:

CREATE RETENTION POLICY "two_years" ON "telegraf" DURATION 720d REPLICATION 1

Retention policies seem to be per database and we store everything in telegraf so that's as fine grained as it gets.

Then we needed to create a continuous query (a query which is performed periodically) to downsample raw data (stored within the default retention period) and store the downsampled data within the 2nd retention period:

CREATE CONTINUOUS QUERY "cq_downsample_some_fields" ON "telegraf" BEGIN
  SELECT mean("ratio_failed_p") AS "mean_ratio_failed_p", mean("incompletes_not_restarted_last_24h") AS "mean_incompletes_not_restarted_last_24h"
  INTO "two_years"."downsampled_postgresql"
  FROM "postgresql"
  GROUP BY time(1h)
END

This query would run every hour and would downsample the values with the field keys ratio_failed_p and incompletes_not_restarted_last_24h within the postgresql measurement by computing the average within one hour. The downsampled values would have different field keys. I was following the documentation here but maybe it makes more sense for us to use the same field keys (if that is possible) so the downsampled values would show up in the same way as the raw values.

This QC is just an example covering two fields from one measurement. I suppose it needed to be extended for all the fields we want to cover. However, that sounds not very practical. I'm wondering whether there's really no better way than dealing with all fields individually and what to do with field values where one cannot easily compute an average. Additionally, it only considers data from the last hour which makes sense for downsampling new data as it comes in. The documentation does not mention how to deal with existing data, though. Hence I'm not sure how it would be handled. If we do nothing further to handle it, I suppose it'll either just be discarded with no downsampled version being stored longer (1) or the new default retention period isn't applied to old data and it'll just be kept forever (2). Maybe it is possible to downsample existing data with a manual query (and drop raw data manually).


We could of course also only create a retention period (for simply discarding data exceeding a certain age without the downsampling step).

#13 Updated by mkittler 28 days ago

By the way, creating a new default retention policy as suggested by the documentation would (to my understanding) not affect any existing data (which is still kept under the current default retention policy which is "autogen"¹). We could of course just use e.g. DELETE WHERE time < '2020-11-08' to delete everything older than a year (from now) but then we'd still needed to repeat that process until all data from the "autogen" retention period has been deleted.

Alternatively we could likely also alter the current "autogen" retention policy like this (as documented):

ALTER RETENTION POLICY "autogen" ON "telegraf" DURATION 720d

I suppose then we wouldn't need to worry about removing existing data.


¹

> SHOW RETENTION POLICIES ON "telegraf"
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

#14 Updated by okurz 20 days ago

ok, so I understand it's not as simple as defining a default downsampling policy that affects all data. That's what I originally thought influxdb would work like. Reading your comments and pages like https://www.influxdata.com/blog/downsampling-influxdb-v2-0/ and https://www.influxdata.com/influxdb-templates/downsampling-influxdb/ and https://hackernoon.com/influxdb-continuous-downsampling-optimize-your-tsdb-today-q3lm2fhg and https://influxdb.narkive.com/AQM2b071/automatically-downsample-all-data#post7 I think it might be better if we identify which measurements have the biggest impact and provide explicit specific retention+downsampling per measurement. Can you identify according to Pareto principle which measurements are the biggest offenders?

#15 Updated by mkittler 17 days ago

It is at least possible to determine the used disk space by measurement, e.g.:

martchus@openqa-monitor:~> sudo influx_inspect report-disk -detailed /var/lib/influxdb/data > influx-disk-usage.json
martchus@openqa-monitor:~> jq -r '["measurement","size_gb"], (.Measurement | sort_by(.size) | reverse[] | [select(.db == "telegraf")] | map(.measurement,.size / 1024 / 1024 / 1024)) | @tsv' influx-disk-usage.json
measurement     size_gb
apache_log      59.99271512031555
cpu     46.17899715155363
procstat        20.055298353545368
disk    7.618690362200141
ping    5.069371770136058
diskio  3.353263510391116
mem     2.891846708022058
net     2.7090410608798265
system  0.9844627007842064
apache  0.5020766351372004
ntpq    0.48354801069945097
kernel  0.4041093336418271
procstat_lookup 0.3167126877233386
chrony  0.30627269111573696
processes       0.209840452298522
postgresql      0.10544836893677711
http_response   0.06108146067708731
swap    0.059172035194933414
openqa_jobs_by_group    0.04418153967708349
openqa_jobs_by_worker   0.039369286969304085
openqa_jobs_by_arch     0.028374306857585907
openqa_minion_jobs      0.024388955906033516
openqa_minion_workers   0.015562820248305798
openqa_jobs     0.014018760062754154
systemd_failed  0.005923120304942131
lab_temperature 0.00429917685687542
httpjson_worker_minion_stats    0.0014895927160978317
my_log  0.00015092454850673676
systemd_failed_test     8.108559995889664e-05
telegraf        2.9811635613441467e-06
openqa-websockets       2.8098002076148987e-06
openqa-scheduler        2.5257468223571777e-06
cron    2.4391338229179382e-06
smb     2.387911081314087e-06
rsyncd  2.343207597732544e-06
openqa-livehandler      2.3096799850463867e-06
salt-minion     2.255663275718689e-06
salt-master     2.255663275718689e-06
vsftpd  2.2351741790771484e-06
openqa-webui    1.8579885363578796e-06
openqa-gru      1.8207356333732605e-06
apache2 1.6884878277778625e-06
postfix 1.6335397958755493e-06
sshd    1.5506520867347717e-06

So I could try it for the fields of the top measurements. Note that some of the field types are string so computing an average like in the previously mentioned example isn't possible. It wouldn't also make much sense for some fields with number types, e.g. http_version¹. I suppose I'll just use last in such cases.


¹

> show field keys from apache_log
name: apache_log
fieldKey         fieldType
--------         ---------
agent            string
auth             string
client_ip        string
http_version     float
ident            string
referrer         string
request          string
resp_bytes       integer
response_time_us integer

#16 Updated by mkittler 14 days ago

The default retention policy can only be set on database level. At least I couldn't find a way to set the default retention policy per measurement. So I suppose I cannot follow the documentation examples when only covering a certain measurement like apache_log as autogen should still be the default retention policy in general. So I need to find a way to move (and not just copy) raw data from one retention policy to another via a continuous query. It looks like this isn't possible, though. All commands/examples I've found only copy the data (and rely on the retention period for the cleanup).


The actual down sampling would look like this by the way:

CREATE CONTINUOUS QUERY "cq_downsample_apache_log" ON "telegraf" BEGIN
    SELECT
        last("agent") AS "agent",
        last("auth") AS "auth",
        last("client_ip") AS "client_ip",
        last("http_version") AS "http_version",
        last("ident") AS "ident",
        last("referrer") AS "referrer",
        last("request") AS "request",
        mean("resp_bytes") AS "resp_bytes",
        mean("response_time_us") AS "response_time_us"
    INTO "two_years"."downsampled_apache_log"
    FROM "half_year"."apache_log"
    GROUP BY time(1h)
END

I've tested the SELECT (without INTO and some time constraint like WHERE time >= '2021-11-22T00:00:00Z') and it worked. The SELECT (with INTO) could also be run manually to cover existing data although one might need to split the query into different time boundaries (see https://docs.influxdata.com/influxdb/v1.8/query_language/explore-data/#rename-a-database).

#17 Updated by mkittler 14 days ago

I've now changed the default retention period to delete everything older than 2 years. I also immediately deleted everything older than 2 years (via delete where time < '2019-11-22'). /var/lib/influxdb shrunk from 185 GiB to 152 GiB.

I've also deleted all data associated with the apache_log measurement older than 1 year (via delete from "apache_log" where time < '2020-11-22') which shrunk it to 150 GiB.

So in the file system the effect isn't significantly visisble. However, report-disk shows a more the savings:

measurement     size_gb
apache_log      32.94414313044399
cpu     28.99280675780028
procstat        18.21759712137282
disk    3.7574257869273424
ping    3.290378055535257
net     2.468618060462177
diskio  1.9918824285268784
mem     1.8506945176050067
system  0.5679652551189065
apache  0.45197963155806065
ntpq    0.4201518790796399
chrony  0.3121297275647521
procstat_lookup 0.30364393070340157
kernel  0.2221601502969861
processes       0.13334486354142427
postgresql      0.10715502314269543
swap    0.04125824477523565
http_response   0.03266702499240637
openqa_jobs_by_group    0.02366778813302517
openqa_minion_jobs      0.022559572011232376
openqa_jobs_by_worker   0.020542572252452374
openqa_jobs_by_arch     0.0148719921708107
openqa_minion_workers   0.014549209736287594
openqa_jobs     0.006770698353648186
systemd_failed  0.005970432423055172
lab_temperature 0.0044286223128438
httpjson_worker_minion_stats    0.0014895927160978317
systemd_failed_test     8.223019540309906e-05

Not sure how to tell InfluxDB to free up space in the file system. Maybe the effect would have been more visible with a smaller shard duration.

#18 Updated by mkittler 6 days ago

Note that the size reported by report-disk decreased from 151.48 GiB to 96.1605 GiB.

#19 Updated by okurz 5 days ago

Ok. To fulfill AC1 what you could do is compress /var/lib/influxdb/dumps/postgresql or move it elsewhere (or delete). I am trying my luck with compression right now. Called nice -n 1 xz -3 --threads=0 postgresql in a root screen session on openqa-monitor.qa . As that file is text with a lot repeated I guess that should be quite efficient.

#20 Updated by okurz 5 days ago

compression finished. file is down to 381 MiB and usage in /var/lib/influxdb is 101G so AC1 fulfilled :)

I recommend you either follow-up with downsampling policies for specific measurements or record the individual ideas for improvements in separate tickets if you want to get rid of the work for now :)

#21 Updated by mkittler 5 days ago

  • Related to action #103380: Configure retention/downsampling policy for specific monitoring data stored within InfluxDB added

#22 Updated by mkittler 5 days ago

  • Status changed from Feedback to Resolved

Sounds good. These files are likely from a previous attempt of me to fix another problem (some type conversion problem). So I suppose we could also just remove these files (if we need the disk space).

I've been creating #103380 as a follow-up but obviously I'm out of ideas so I couldn't record any. Of course I noted that one should look into my previous attempts from this ticket.

Also available in: Atom PDF