action #100859
closedinvestigate how to optimize /srv data utilization on OSD size:S
0%
Description
Motivation¶
Over 80% space of /srv on OSD are used up. Most of this data is used by our postgresql database.
I raised this concern in slack where some possible reasons where discussed. One of them is to figure out why postgresql uses so much data. @mkittler mentioned that a fresh database import lowers disk space consumption drastically.
Also see poo#89821 for some history about the alert itself.
Acceptance Criteria¶
AC1: Alert does not trigger any longer
AC2: Understand why our production database uses the space it uses
Suggestions¶
- Enlarge partition by opening an eng-infra ticket and ask for some more space for /dev/vdb
- Figure out if the disk utilization of our database can be optimized
- Try if the disk utilization can be reduced. E.g. by running the postgresql vaccum
- See if an auto vaccum can be configured or if thresholds can be lowered (https://suse.slack.com/archives/C02AJ1E568M/p1634033225193500?thread_ts=1634030652.186600&cid=C02AJ1E568M)
Updated by okurz almost 3 years ago
- Description updated (diff)
- Target version set to Ready
Updated by okurz almost 3 years ago
- Related to action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages) added
Updated by mkittler almost 3 years ago
- Subject changed from investigate how to optimize /srv data utilization on OSD to investigate how to optimize /srv data utilization on OSD size:S
- Status changed from New to Workable
Updated by okurz almost 3 years ago
- Copied to action #100976: specific alert about the size of our database, maybe even database tables? added
Updated by okurz almost 3 years ago
- Copied to action #100985: Come up with a way to regularly check job group configs for outliers and misconfiguration, e.g. overly long result retention periods added
Updated by mkittler almost 3 years ago
- Status changed from Workable to Feedback
The database has definitely gotten bigger from 54.0 GiB to 71.4 GiB (factor of 1.322) compared to my last check 7 months ago (see #89821#note-4):
--- /srv
71,4 GiB [##########] /PSQL
5,6 GiB [ ] /log
1,2 GiB [ ] homes.img
28,9 MiB [ ] /salt
12,0 MiB [ ] /pillar
4,0 KiB [ ] /reactor
0,0 B [ ] /www
e 0,0 B [ ] /tftpboot
e 0,0 B [ ] /svn
e 0,0 B [ ] /spm
e 0,0 B [ ] /ftp
e 0,0 B [ ] /backup
The size used by logs and home directories even slightly decreased.
Similar to my last check, a lot of the data is consumed by indexes:
openqa=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public';
table_sizes | index_sizes | ratio
-------------+-------------+------------------------
19 GB | 52 GB | 0.35656535508820541697
Note that a local database import takes much less space although we have auto vacuum enabled on OSD (see select relname,last_vacuum, last_autovacuum, last_analyze, last_autoanalyze from pg_stat_user_tables;
).
Here are figures from an older import (2021-09-08) which has only 568383 jobs:
openqa-osd=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public';
table_sizes | index_sizes | ratio
-------------+-------------+------------------------
13 GB | 17 GB | 0.79865433261186920688
(1 Zeile)
So in contrast to an almost fresh import, the indexes make the most difference.
By the way, OSD has currently 792339 jobs so in the last month we increased the number of jobs by a factor of 1.394. I'm simply comparing with my local database dump. Note that this is close to the factor of the database growth mentioned before but this factor is for the last 7 months.
That's how it looks in detail:
openqa=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc;
table_name | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes
---------------------------------------+---------------------------+----------------------+------------------------
job_settings | 22 GB | 17 GB | 23407943680
job_modules | 16 GB | 12 GB | 17496236032
screenshot_links | 16 GB | 12 GB | 17401176064
screenshots | 11 GB | 7314 MB | 11747926016
jobs | 3392 MB | 2826 MB | 3557269504
jobs_assets | 796 MB | 619 MB | 834920448
audit_events | 641 MB | 180 MB | 672374784
scheduled_products | 619 MB | 65 MB | 649363456
job_dependencies | 166 MB | 134 MB | 174505984
needles | 102 MB | 81 MB | 106872832
comments | 59 MB | 33 MB | 62283776
gru_tasks | 21 MB | 19 MB | 21757952
minion_jobs | 18 MB | 8920 kB | 18718720
assets | 17 MB | 8232 kB | 17563648
job_template_settings | 16 MB | 11 MB | 16261120
job_templates | 8040 kB | 6392 kB | 8232960
workers | 7216 kB | 2136 kB | 7389184
test_suite_settings | 6792 kB | 3848 kB | 6955008
worker_properties | 2560 kB | 1368 kB | 2621440
product_settings | 2392 kB | 1232 kB | 2449408
job_groups | 1520 kB | 168 kB | 1556480
bugs | 1272 kB | 384 kB | 1302528
test_suites | 928 kB | 424 kB | 950272
job_locks | 672 kB | 288 kB | 688128
products | 320 kB | 160 kB | 327680
job_networks | 200 kB | 144 kB | 204800
users | 192 kB | 80 kB | 196608
developer_sessions | 176 kB | 88 kB | 180224
api_keys | 112 kB | 48 kB | 114688
minion_locks | 104 kB | 48 kB | 106496
dbix_class_deploymenthandler_versions | 104 kB | 32 kB | 106496
machine_settings | 96 kB | 0 bytes | 98304
machines | 88 kB | 32 kB | 90112
gru_dependencies | 88 kB | 48 kB | 90112
job_group_parents | 80 kB | 32 kB | 81920
minion_workers | 80 kB | 24 kB | 81920
needle_dirs | 80 kB | 32 kB | 81920
secrets | 80 kB | 32 kB | 81920
mojo_migrations | 64 kB | 16 kB | 65536
pgbench_history | 40 kB | 0 bytes | 40960
(40 Zeilen)
Long story short: Job-related tables are the problem.
By the way, in my local dump from 2021-09-08 pg_size_pretty
doesn't even use GB:
openqa-osd=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc;
table_name | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes
---------------------------------------+---------------------------+----------------------+------------------------
job_settings | 9936 MB | 6486 MB | 10418798592
screenshot_links | 6847 MB | 3789 MB | 7179689984
screenshots | 6491 MB | 3187 MB | 6806061056
job_modules | 5650 MB | 2990 MB | 5924634624
scheduled_products | 485 MB | 45 MB | 508469248
jobs | 368 MB | 195 MB | 385761280
audit_events | 313 MB | 52 MB | 328695808
…
@coolo mentioned in the chat that he tried a few things. All the OSD figures mentioned so far have been collected after that.
Vacuum wasn't taking long and didn't help
we just have 840K jobs alive - this is just insane
63788 alone in "Maintenance: SLE 15 GA HPC Incidents"
…
I limited it to 21 days now - this should remove the bulk submission from september. But I guess we should align all maintenance incident groups that way
…
This freed 3GB! Just the HPC Incidents
The full thread is here: https://suse.slack.com/archives/C02AJ1E568M/p1634030652186600
There's also another thread: https://suse.slack.com/archives/C02AJ1E568M/p1634206976332400
I ran
vacuum screenshots
- took <5 minutes
Before:
bugs | 9568 kB | 9797632
After:
bugs | 1240 kB | 1269760
This explanation should fulfill AC2.
Updated by mkittler almost 3 years ago
Note that AC1 is technically also fulfilled because the threshold in our monitoring for /srv
is 90 %. I assume @nicksinger it mentioning a different alert, e.g. from infra. The current usage is still slightly over 80 %.
Updated by mkittler almost 3 years ago
In conclusion I'd say we can either reduce the number of jobs we store or we demand more disk space.
Alternatively we could also try to vacuum more explicitly despite the fact that autovacuum should already cover us. I'm not sure whether we'll actually be able to free much disk space this way and whether it is practical to do for the big tables. I'll give https://www.postgresql.org/docs/12/routine-vacuuming.html a read.
Updated by okurz almost 3 years ago
mkittler wrote:
In conclusion I'd say we can either reduce the number of jobs we store or we demand more disk space.
Alternatively we could also try to vacuum more explicitly despite the fact that autovacuum should already cover us
coolo ran a manual vacuum full
for the "top 3 - top 8 tables". screenshot_links went from 11G to 3G, took 4 minutes (https://suse.slack.com/archives/C02AJ1E568M/p1634243304354100?thread_ts=1634030652.186600&cid=C02AJ1E568M). Now we use 62% (62G) so also increasing the disk size is not necessary right now.
If we have an icinga alert then that should be higher than our grafana alert threshold.
For other partitions we already have adjusted the limits and made them more consistent to mostly 92% warn and 95% alert limit: https://gitlab.suse.de/OPS-Service/monitoring/-/blob/master/check_mk/nue-cmk/main.mk#L51
https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=openqa.suse.de shows an overview and there are two services fs_/srv
and fs_/space-slow
where I think we should also have those thresholds:
Created https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/20 for that. The strange this is that #89821 referenced https://infra.nue.suse.com/SelfService/Display.html?id=186868 which should have done that already but I don't see it in git and also not in https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=openqa.suse.de
Updated by mkittler almost 3 years ago
When I look at thruk for fs_/srv
and fs_/space-slow
it looks like the alert is at 95 % (by dividing the size by "critical at … GiB"). So I'm wondering where the alert @nicksinger saw comes from.
It is good to know that full vacuuming helps, in particular with the index tables:
openqa=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public';
table_sizes | index_sizes | ratio
-------------+-------------+------------------------
18 GB | 38 GB | 0.46636455882797356600
(1 Zeile)
@coolo said he did not ran the full vaccuming on the two biggest tables. So I assume job_modules
and job_settings
are still left to optimize.
the screenshot_links table was 11G before and is 3G after - and it took ~5 minutes to compact
So job_modules
will take about 7.3 minutes and job_settings
10 minutes (if it scales linearly). That's possibly causing a downtime which will lead to incomplete jobs.
Updated by nicksinger almost 3 years ago
mkittler wrote:
When I look at thruk for
fs_/srv
andfs_/space-slow
it looks like the alert is at 95 % (by dividing the size by "critical at … GiB"). So I'm wondering where the alert @nicksinger saw comes from.
Nah we never saw a critical at https://nagios-devel.suse.de/pnp4nagios/graph?host=openqa.suse.de&srv=fs__srv&view=4&theme=smoothness but rather a "warning" which was reason enough for me to open up this ticket here. Acting only on "criticals" can sometimes be a little bit late if more investigation is needed. But indeed we're clearly under our warning threshold again.
Updated by mkittler almost 3 years ago
- Status changed from Feedback to Resolved
I vacuumed the two remaining tables yesterday evening and it helped:
openqa=> VACUUM FULL VERBOSE job_modules;
INFO: vacuuming "public.job_modules"
INFO: "job_modules": found 382300 removable, 28551735 nonremovable row versions in 529816 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 31.22 s, system: 9.92 s, elapsed: 50.54 s.
openqa=> VACUUM FULL VERBOSE job_settings;
INFO: vacuuming "public.job_settings"
INFO: "job_settings": found 375941 removable, 39664991 nonremovable row versions in 666744 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 35.16 s, system: 14.95 s, elapsed: 91.57 s.
openqa=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc;
table_name | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes
---------------------------------------+---------------------------+----------------------+------------------------
job_settings | 11 GB | 7610 MB | 12255084544
screenshot_links | 8595 MB | 4816 MB | 9012715520
job_modules | 7476 MB | 3946 MB | 7838711808
screenshots | 6612 MB | 3184 MB | 6932815872
jobs | 853 MB | 526 MB | 894287872
scheduled_products | 611 MB | 53 MB | 640925696
audit_events | 538 MB | 92 MB | 563773440
jobs_assets | 371 MB | 193 MB | 389275648
needles | 102 MB | 81 MB | 106905600
job_dependencies | 85 MB | 57 MB | 88604672
comments | 64 MB | 34 MB | 66936832
workers | 41 MB | 10 MB | 43474944
minion_jobs | 25 MB | 15 MB | 25690112
assets | 23 MB | 12 MB | 24035328
gru_tasks | 21 MB | 19 MB | 21757952
job_template_settings | 16 MB | 11 MB | 16261120
job_templates | 8040 kB | 6392 kB | 8232960
test_suite_settings | 6792 kB | 3848 kB | 6955008
worker_properties | 2584 kB | 1392 kB | 2646016
product_settings | 2392 kB | 1232 kB | 2449408
job_groups | 1520 kB | 168 kB | 1556480
bugs | 1480 kB | 424 kB | 1515520
test_suites | 928 kB | 424 kB | 950272
job_locks | 896 kB | 592 kB | 917504
products | 320 kB | 160 kB | 327680
minion_locks | 272 kB | 176 kB | 278528
job_networks | 208 kB | 152 kB | 212992
users | 192 kB | 80 kB | 196608
developer_sessions | 176 kB | 88 kB | 180224
api_keys | 112 kB | 48 kB | 114688
dbix_class_deploymenthandler_versions | 104 kB | 32 kB | 106496
machine_settings | 96 kB | 0 bytes | 98304
gru_dependencies | 88 kB | 48 kB | 90112
machines | 88 kB | 32 kB | 90112
job_group_parents | 80 kB | 32 kB | 81920
secrets | 80 kB | 32 kB | 81920
needle_dirs | 80 kB | 32 kB | 81920
minion_workers | 80 kB | 24 kB | 81920
mojo_migrations | 64 kB | 16 kB | 65536
pgbench_history | 40 kB | 0 bytes | 40960
/srv
is back at 43.5 % so I suppose we can close the ticket.
I've now been restarting all jobs which incompleted due to an API failure since I started the vacuuming because this time it caused several incompletes (but not as many to hit the alert threshold).
Updated by okurz almost 3 years ago
- Related to action #101283: [easy][beginner] Retry if webUI times out connection yielding 408 response "Request timeout" added
Updated by okurz almost 3 years ago
also https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/20 was merged and made effective now.