Configure retention/downsampling policy for monitoring data stored within InfluxDB size:M
Currently the default retention period is in-place:
> use telegraf Using database telegraf > show retention policies name duration shardGroupDuration replicaN default ---- -------- ------------------ -------- ------- autogen 0s 168h0m0s 1 true
According to the documentation that means we store the data infinitely¹.
It would be possible to configure a time-based retention policy per database (we store all our monitoring data in one database called
This raises the question which durations we'd like to configure. It would also make sense to "downsample and retain data" to store data with less granularity over time.
I haven't found anything about disk-space-based retention policies within the documentation of InfluxDB so I assume this is not an option.
When you create a database, InfluxDB creates a retention policy called autogen with an infinite duration…
- AC1: The directory monitor.qa.suse.de:/var/lib/influxdb has significantly less than the current 120GB
- AC2: Grafana is able to show at least 6 months worth of (optionally downsampled) data
- Ensure we have a backup before the process remove data
- Check what the current InfluxDB version 1.8 provides
- If we have downsampling, we should aim for at least 1 month of full-sampled data data
- Checkout previous attempts to export data
- dump is still lying around at
/var/lib/influxdb/dumps/postgresql; it can be deleted if the disk space is needed
- see #93922#note-10 but do not make the same mistake of using
grepagain for filtering
- dump is still lying around at
- Subject changed from Configure retention policy for monitoring data stored within InfluxDB size:M to Configure retention/downsampling policy for monitoring data stored within InfluxDB size:M
- Description updated (diff)
- Status changed from New to Workable
Fixes and additions as discussed verbally with the team in the joint estimation session
- Status changed from Blocked to Workable
- Assignee deleted (
I rejected #97583 as I checked on OBS and we do not have a more recent version than our production influxdb 1.8 anywhere on OBS. So this ticket is really about what is currently provided in openSUSE packages, i.e. currently 1.8.6
#12 Updated by mkittler about 1 month ago
- Status changed from Workable to Feedback
- Assignee set to mkittler
Judging by the documentation it looks like this will be rather coarse, static and lots of manual effort.
Here some details. First, we needed to change the default retention policy, e.g. the following command would cause new raw data to be deleted after a half year:
CREATE RETENTION POLICY "half_year" ON "telegraf" DURATION 180d REPLICATION 1 DEFAULT
Then we'd create a 2nd retention policy for "downsampled" data, e.g. to store it for approximately two years:
CREATE RETENTION POLICY "two_years" ON "telegraf" DURATION 720d REPLICATION 1
Retention policies seem to be per database and we store everything in
telegraf so that's as fine grained as it gets.
Then we needed to create a continuous query (a query which is performed periodically) to downsample raw data (stored within the default retention period) and store the downsampled data within the 2nd retention period:
CREATE CONTINUOUS QUERY "cq_downsample_some_fields" ON "telegraf" BEGIN SELECT mean("ratio_failed_p") AS "mean_ratio_failed_p", mean("incompletes_not_restarted_last_24h") AS "mean_incompletes_not_restarted_last_24h" INTO "two_years"."downsampled_postgresql" FROM "postgresql" GROUP BY time(1h) END
This query would run every hour and would downsample the values with the field keys
incompletes_not_restarted_last_24h within the
postgresql measurement by computing the average within one hour. The downsampled values would have different field keys. I was following the documentation here but maybe it makes more sense for us to use the same field keys (if that is possible) so the downsampled values would show up in the same way as the raw values.
This QC is just an example covering two fields from one measurement. I suppose it needed to be extended for all the fields we want to cover. However, that sounds not very practical. I'm wondering whether there's really no better way than dealing with all fields individually and what to do with field values where one cannot easily compute an average. Additionally, it only considers data from the last hour which makes sense for downsampling new data as it comes in. The documentation does not mention how to deal with existing data, though. Hence I'm not sure how it would be handled. If we do nothing further to handle it, I suppose it'll either just be discarded with no downsampled version being stored longer (1) or the new default retention period isn't applied to old data and it'll just be kept forever (2). Maybe it is possible to downsample existing data with a manual query (and drop raw data manually).
We could of course also only create a retention period (for simply discarding data exceeding a certain age without the downsampling step).
By the way, creating a new default retention policy as suggested by the documentation would (to my understanding) not affect any existing data (which is still kept under the current default retention policy which is "autogen"¹). We could of course just use e.g.
DELETE WHERE time < '2020-11-08' to delete everything older than a year (from now) but then we'd still needed to repeat that process until all data from the "autogen" retention period has been deleted.
Alternatively we could likely also alter the current "autogen" retention policy like this (as documented):
ALTER RETENTION POLICY "autogen" ON "telegraf" DURATION 720d
I suppose then we wouldn't need to worry about removing existing data.
> SHOW RETENTION POLICIES ON "telegraf" name duration shardGroupDuration replicaN default ---- -------- ------------------ -------- ------- autogen 0s 168h0m0s 1 true
ok, so I understand it's not as simple as defining a default downsampling policy that affects all data. That's what I originally thought influxdb would work like. Reading your comments and pages like https://www.influxdata.com/blog/downsampling-influxdb-v2-0/ and https://www.influxdata.com/influxdb-templates/downsampling-influxdb/ and https://hackernoon.com/influxdb-continuous-downsampling-optimize-your-tsdb-today-q3lm2fhg and https://influxdb.narkive.com/AQM2b071/automatically-downsample-all-data#post7 I think it might be better if we identify which measurements have the biggest impact and provide explicit specific retention+downsampling per measurement. Can you identify according to Pareto principle which measurements are the biggest offenders?
It is at least possible to determine the used disk space by measurement, e.g.:
martchus@openqa-monitor:~> sudo influx_inspect report-disk -detailed /var/lib/influxdb/data > influx-disk-usage.json martchus@openqa-monitor:~> jq -r '["measurement","size_gb"], (.Measurement | sort_by(.size) | reverse | [select(.db == "telegraf")] | map(.measurement,.size / 1024 / 1024 / 1024)) | @tsv' influx-disk-usage.json measurement size_gb apache_log 59.99271512031555 cpu 46.17899715155363 procstat 20.055298353545368 disk 7.618690362200141 ping 5.069371770136058 diskio 3.353263510391116 mem 2.891846708022058 net 2.7090410608798265 system 0.9844627007842064 apache 0.5020766351372004 ntpq 0.48354801069945097 kernel 0.4041093336418271 procstat_lookup 0.3167126877233386 chrony 0.30627269111573696 processes 0.209840452298522 postgresql 0.10544836893677711 http_response 0.06108146067708731 swap 0.059172035194933414 openqa_jobs_by_group 0.04418153967708349 openqa_jobs_by_worker 0.039369286969304085 openqa_jobs_by_arch 0.028374306857585907 openqa_minion_jobs 0.024388955906033516 openqa_minion_workers 0.015562820248305798 openqa_jobs 0.014018760062754154 systemd_failed 0.005923120304942131 lab_temperature 0.00429917685687542 httpjson_worker_minion_stats 0.0014895927160978317 my_log 0.00015092454850673676 systemd_failed_test 8.108559995889664e-05 telegraf 2.9811635613441467e-06 openqa-websockets 2.8098002076148987e-06 openqa-scheduler 2.5257468223571777e-06 cron 2.4391338229179382e-06 smb 2.387911081314087e-06 rsyncd 2.343207597732544e-06 openqa-livehandler 2.3096799850463867e-06 salt-minion 2.255663275718689e-06 salt-master 2.255663275718689e-06 vsftpd 2.2351741790771484e-06 openqa-webui 1.8579885363578796e-06 openqa-gru 1.8207356333732605e-06 apache2 1.6884878277778625e-06 postfix 1.6335397958755493e-06 sshd 1.5506520867347717e-06
So I could try it for the fields of the top measurements. Note that some of the field types are string so computing an average like in the previously mentioned example isn't possible. It wouldn't also make much sense for some fields with number types, e.g.
http_version¹. I suppose I'll just use
last in such cases.
> show field keys from apache_log name: apache_log fieldKey fieldType -------- --------- agent string auth string client_ip string http_version float ident string referrer string request string resp_bytes integer response_time_us integer
The default retention policy can only be set on database level. At least I couldn't find a way to set the default retention policy per measurement. So I suppose I cannot follow the documentation examples when only covering a certain measurement like
autogen should still be the default retention policy in general. So I need to find a way to move (and not just copy) raw data from one retention policy to another via a continuous query. It looks like this isn't possible, though. All commands/examples I've found only copy the data (and rely on the retention period for the cleanup).
The actual down sampling would look like this by the way:
CREATE CONTINUOUS QUERY "cq_downsample_apache_log" ON "telegraf" BEGIN SELECT last("agent") AS "agent", last("auth") AS "auth", last("client_ip") AS "client_ip", last("http_version") AS "http_version", last("ident") AS "ident", last("referrer") AS "referrer", last("request") AS "request", mean("resp_bytes") AS "resp_bytes", mean("response_time_us") AS "response_time_us" INTO "two_years"."downsampled_apache_log" FROM "half_year"."apache_log" GROUP BY time(1h) END
I've tested the
INTO and some time constraint like
WHERE time >= '2021-11-22T00:00:00Z') and it worked. The
INTO) could also be run manually to cover existing data although one might need to split the query into different time boundaries (see https://docs.influxdata.com/influxdb/v1.8/query_language/explore-data/#rename-a-database).
I've now changed the default retention period to delete everything older than 2 years. I also immediately deleted everything older than 2 years (via
delete where time < '2019-11-22').
/var/lib/influxdb shrunk from 185 GiB to 152 GiB.
I've also deleted all data associated with the apache_log measurement older than 1 year (via
delete from "apache_log" where time < '2020-11-22') which shrunk it to
So in the file system the effect isn't significantly visisble. However,
report-disk shows a more the savings:
measurement size_gb apache_log 32.94414313044399 cpu 28.99280675780028 procstat 18.21759712137282 disk 3.7574257869273424 ping 3.290378055535257 net 2.468618060462177 diskio 1.9918824285268784 mem 1.8506945176050067 system 0.5679652551189065 apache 0.45197963155806065 ntpq 0.4201518790796399 chrony 0.3121297275647521 procstat_lookup 0.30364393070340157 kernel 0.2221601502969861 processes 0.13334486354142427 postgresql 0.10715502314269543 swap 0.04125824477523565 http_response 0.03266702499240637 openqa_jobs_by_group 0.02366778813302517 openqa_minion_jobs 0.022559572011232376 openqa_jobs_by_worker 0.020542572252452374 openqa_jobs_by_arch 0.0148719921708107 openqa_minion_workers 0.014549209736287594 openqa_jobs 0.006770698353648186 systemd_failed 0.005970432423055172 lab_temperature 0.0044286223128438 httpjson_worker_minion_stats 0.0014895927160978317 systemd_failed_test 8.223019540309906e-05
Not sure how to tell InfluxDB to free up space in the file system. Maybe the effect would have been more visible with a smaller shard duration.
Ok. To fulfill AC1 what you could do is compress /var/lib/influxdb/dumps/postgresql or move it elsewhere (or delete). I am trying my luck with compression right now. Called
nice -n 1 xz -3 --threads=0 postgresql in a root screen session on openqa-monitor.qa . As that file is text with a lot repeated I guess that should be quite efficient.
compression finished. file is down to 381 MiB and usage in /var/lib/influxdb is 101G so AC1 fulfilled :)
I recommend you either follow-up with downsampling policies for specific measurements or record the individual ideas for improvements in separate tickets if you want to get rid of the work for now :)
- Status changed from Feedback to Resolved
Sounds good. These files are likely from a previous attempt of me to fix another problem (some type conversion problem). So I suppose we could also just remove these files (if we need the disk space).
I've been creating #103380 as a follow-up but obviously I'm out of ideas so I couldn't record any. Of course I noted that one should look into my previous attempts from this ticket.