action #100859
investigate how to optimize /srv data utilization on OSD size:S
0%
Description
Motivation¶
Over 80% space of /srv on OSD are used up. Most of this data is used by our postgresql database.
I raised this concern in slack where some possible reasons where discussed. One of them is to figure out why postgresql uses so much data. mkittler mentioned that a fresh database import lowers disk space consumption drastically.
Also see poo#89821 for some history about the alert itself.
Acceptance Criteria¶
AC1: Alert does not trigger any longer
AC2: Understand why our production database uses the space it uses
Suggestions¶
- Enlarge partition by opening an eng-infra ticket and ask for some more space for /dev/vdb
- Figure out if the disk utilization of our database can be optimized
- Try if the disk utilization can be reduced. E.g. by running the postgresql vaccum
- See if an auto vaccum can be configured or if thresholds can be lowered (https://suse.slack.com/archives/C02AJ1E568M/p1634033225193500?thread_ts=1634030652.186600&cid=C02AJ1E568M)
Related issues
History
#1
Updated by okurz over 1 year ago
- Description updated (diff)
- Target version set to Ready
#2
Updated by okurz over 1 year ago
- Related to action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages) added
#3
Updated by mkittler over 1 year ago
- Subject changed from investigate how to optimize /srv data utilization on OSD to investigate how to optimize /srv data utilization on OSD size:S
- Status changed from New to Workable
#4
Updated by okurz over 1 year ago
- Copied to action #100976: specific alert about the size of our database, maybe even database tables? added
#5
Updated by okurz over 1 year ago
- Copied to action #100985: Come up with a way to regularly check job group configs for outliers and misconfiguration, e.g. overly long result retention periods added
#6
Updated by mkittler over 1 year ago
- Assignee set to mkittler
#7
Updated by mkittler over 1 year ago
- Status changed from Workable to Feedback
The database has definitely gotten bigger from 54.0 GiB to 71.4 GiB (factor of 1.322) compared to my last check 7 months ago (see #89821#note-4):
--- /srv 71,4 GiB [##########] /PSQL 5,6 GiB [ ] /log 1,2 GiB [ ] homes.img 28,9 MiB [ ] /salt 12,0 MiB [ ] /pillar 4,0 KiB [ ] /reactor 0,0 B [ ] /www e 0,0 B [ ] /tftpboot e 0,0 B [ ] /svn e 0,0 B [ ] /spm e 0,0 B [ ] /ftp e 0,0 B [ ] /backup
The size used by logs and home directories even slightly decreased.
Similar to my last check, a lot of the data is consumed by indexes:
openqa=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public'; table_sizes | index_sizes | ratio -------------+-------------+------------------------ 19 GB | 52 GB | 0.35656535508820541697
Note that a local database import takes much less space although we have auto vacuum enabled on OSD (see select relname,last_vacuum, last_autovacuum, last_analyze, last_autoanalyze from pg_stat_user_tables;
).
Here are figures from an older import (2021-09-08) which has only 568383 jobs:
openqa-osd=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public'; table_sizes | index_sizes | ratio -------------+-------------+------------------------ 13 GB | 17 GB | 0.79865433261186920688 (1 Zeile)
So in contrast to an almost fresh import, the indexes make the most difference.
By the way, OSD has currently 792339 jobs so in the last month we increased the number of jobs by a factor of 1.394. I'm simply comparing with my local database dump. Note that this is close to the factor of the database growth mentioned before but this factor is for the last 7 months.
That's how it looks in detail:
openqa=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc; table_name | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes ---------------------------------------+---------------------------+----------------------+------------------------ job_settings | 22 GB | 17 GB | 23407943680 job_modules | 16 GB | 12 GB | 17496236032 screenshot_links | 16 GB | 12 GB | 17401176064 screenshots | 11 GB | 7314 MB | 11747926016 jobs | 3392 MB | 2826 MB | 3557269504 jobs_assets | 796 MB | 619 MB | 834920448 audit_events | 641 MB | 180 MB | 672374784 scheduled_products | 619 MB | 65 MB | 649363456 job_dependencies | 166 MB | 134 MB | 174505984 needles | 102 MB | 81 MB | 106872832 comments | 59 MB | 33 MB | 62283776 gru_tasks | 21 MB | 19 MB | 21757952 minion_jobs | 18 MB | 8920 kB | 18718720 assets | 17 MB | 8232 kB | 17563648 job_template_settings | 16 MB | 11 MB | 16261120 job_templates | 8040 kB | 6392 kB | 8232960 workers | 7216 kB | 2136 kB | 7389184 test_suite_settings | 6792 kB | 3848 kB | 6955008 worker_properties | 2560 kB | 1368 kB | 2621440 product_settings | 2392 kB | 1232 kB | 2449408 job_groups | 1520 kB | 168 kB | 1556480 bugs | 1272 kB | 384 kB | 1302528 test_suites | 928 kB | 424 kB | 950272 job_locks | 672 kB | 288 kB | 688128 products | 320 kB | 160 kB | 327680 job_networks | 200 kB | 144 kB | 204800 users | 192 kB | 80 kB | 196608 developer_sessions | 176 kB | 88 kB | 180224 api_keys | 112 kB | 48 kB | 114688 minion_locks | 104 kB | 48 kB | 106496 dbix_class_deploymenthandler_versions | 104 kB | 32 kB | 106496 machine_settings | 96 kB | 0 bytes | 98304 machines | 88 kB | 32 kB | 90112 gru_dependencies | 88 kB | 48 kB | 90112 job_group_parents | 80 kB | 32 kB | 81920 minion_workers | 80 kB | 24 kB | 81920 needle_dirs | 80 kB | 32 kB | 81920 secrets | 80 kB | 32 kB | 81920 mojo_migrations | 64 kB | 16 kB | 65536 pgbench_history | 40 kB | 0 bytes | 40960 (40 Zeilen)
Long story short: Job-related tables are the problem.
By the way, in my local dump from 2021-09-08 pg_size_pretty
doesn't even use GB:
openqa-osd=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc; table_name | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes ---------------------------------------+---------------------------+----------------------+------------------------ job_settings | 9936 MB | 6486 MB | 10418798592 screenshot_links | 6847 MB | 3789 MB | 7179689984 screenshots | 6491 MB | 3187 MB | 6806061056 job_modules | 5650 MB | 2990 MB | 5924634624 scheduled_products | 485 MB | 45 MB | 508469248 jobs | 368 MB | 195 MB | 385761280 audit_events | 313 MB | 52 MB | 328695808 …
coolo mentioned in the chat that he tried a few things. All the OSD figures mentioned so far have been collected after that.
Vacuum wasn't taking long and didn't help
we just have 840K jobs alive - this is just insane
63788 alone in "Maintenance: SLE 15 GA HPC Incidents"
…
I limited it to 21 days now - this should remove the bulk submission from september. But I guess we should align all maintenance incident groups that way
…
This freed 3GB! Just the HPC Incidents
The full thread is here: https://suse.slack.com/archives/C02AJ1E568M/p1634030652186600
There's also another thread: https://suse.slack.com/archives/C02AJ1E568M/p1634206976332400
I ran
vacuum screenshots
- took <5 minutes
Before: bugs | 9568 kB | 9797632 After: bugs | 1240 kB | 1269760
This explanation should fulfill AC2.
#8
Updated by mkittler over 1 year ago
Note that AC1 is technically also fulfilled because the threshold in our monitoring for /srv
is 90 %. I assume nicksinger it mentioning a different alert, e.g. from infra. The current usage is still slightly over 80 %.
#9
Updated by mkittler over 1 year ago
In conclusion I'd say we can either reduce the number of jobs we store or we demand more disk space.
Alternatively we could also try to vacuum more explicitly despite the fact that autovacuum should already cover us. I'm not sure whether we'll actually be able to free much disk space this way and whether it is practical to do for the big tables. I'll give https://www.postgresql.org/docs/12/routine-vacuuming.html a read.
#10
Updated by okurz over 1 year ago
mkittler wrote:
In conclusion I'd say we can either reduce the number of jobs we store or we demand more disk space.
Alternatively we could also try to vacuum more explicitly despite the fact that autovacuum should already cover us
coolo ran a manual vacuum full
for the "top 3 - top 8 tables". screenshot_links went from 11G to 3G, took 4 minutes (https://suse.slack.com/archives/C02AJ1E568M/p1634243304354100?thread_ts=1634030652.186600&cid=C02AJ1E568M). Now we use 62% (62G) so also increasing the disk size is not necessary right now.
If we have an icinga alert then that should be higher than our grafana alert threshold.
For other partitions we already have adjusted the limits and made them more consistent to mostly 92% warn and 95% alert limit: https://gitlab.suse.de/OPS-Service/monitoring/-/blob/master/check_mk/nue-cmk/main.mk#L51
https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=openqa.suse.de shows an overview and there are two services fs_/srv
and fs_/space-slow
where I think we should also have those thresholds:
Created https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/20 for that. The strange this is that #89821 referenced https://infra.nue.suse.com/SelfService/Display.html?id=186868 which should have done that already but I don't see it in git and also not in https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=openqa.suse.de
#11
Updated by mkittler over 1 year ago
When I look at thruk for fs_/srv
and fs_/space-slow
it looks like the alert is at 95 % (by dividing the size by "critical at … GiB"). So I'm wondering where the alert nicksinger saw comes from.
It is good to know that full vacuuming helps, in particular with the index tables:
openqa=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public'; table_sizes | index_sizes | ratio -------------+-------------+------------------------ 18 GB | 38 GB | 0.46636455882797356600 (1 Zeile)
coolo said he did not ran the full vaccuming on the two biggest tables. So I assume job_modules
and job_settings
are still left to optimize.
the screenshot_links table was 11G before and is 3G after - and it took ~5 minutes to compact
So job_modules
will take about 7.3 minutes and job_settings
10 minutes (if it scales linearly). That's possibly causing a downtime which will lead to incomplete jobs.
#12
Updated by nicksinger over 1 year ago
mkittler wrote:
When I look at thruk for
fs_/srv
andfs_/space-slow
it looks like the alert is at 95 % (by dividing the size by "critical at … GiB"). So I'm wondering where the alert nicksinger saw comes from.
Nah we never saw a critical at https://nagios-devel.suse.de/pnp4nagios/graph?host=openqa.suse.de&srv=fs__srv&view=4&theme=smoothness but rather a "warning" which was reason enough for me to open up this ticket here. Acting only on "criticals" can sometimes be a little bit late if more investigation is needed. But indeed we're clearly under our warning threshold again.
#13
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
I vacuumed the two remaining tables yesterday evening and it helped:
openqa=> VACUUM FULL VERBOSE job_modules; INFO: vacuuming "public.job_modules" INFO: "job_modules": found 382300 removable, 28551735 nonremovable row versions in 529816 pages DETAIL: 0 dead row versions cannot be removed yet. CPU: user: 31.22 s, system: 9.92 s, elapsed: 50.54 s. openqa=> VACUUM FULL VERBOSE job_settings; INFO: vacuuming "public.job_settings" INFO: "job_settings": found 375941 removable, 39664991 nonremovable row versions in 666744 pages DETAIL: 0 dead row versions cannot be removed yet. CPU: user: 35.16 s, system: 14.95 s, elapsed: 91.57 s. openqa=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc; table_name | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes ---------------------------------------+---------------------------+----------------------+------------------------ job_settings | 11 GB | 7610 MB | 12255084544 screenshot_links | 8595 MB | 4816 MB | 9012715520 job_modules | 7476 MB | 3946 MB | 7838711808 screenshots | 6612 MB | 3184 MB | 6932815872 jobs | 853 MB | 526 MB | 894287872 scheduled_products | 611 MB | 53 MB | 640925696 audit_events | 538 MB | 92 MB | 563773440 jobs_assets | 371 MB | 193 MB | 389275648 needles | 102 MB | 81 MB | 106905600 job_dependencies | 85 MB | 57 MB | 88604672 comments | 64 MB | 34 MB | 66936832 workers | 41 MB | 10 MB | 43474944 minion_jobs | 25 MB | 15 MB | 25690112 assets | 23 MB | 12 MB | 24035328 gru_tasks | 21 MB | 19 MB | 21757952 job_template_settings | 16 MB | 11 MB | 16261120 job_templates | 8040 kB | 6392 kB | 8232960 test_suite_settings | 6792 kB | 3848 kB | 6955008 worker_properties | 2584 kB | 1392 kB | 2646016 product_settings | 2392 kB | 1232 kB | 2449408 job_groups | 1520 kB | 168 kB | 1556480 bugs | 1480 kB | 424 kB | 1515520 test_suites | 928 kB | 424 kB | 950272 job_locks | 896 kB | 592 kB | 917504 products | 320 kB | 160 kB | 327680 minion_locks | 272 kB | 176 kB | 278528 job_networks | 208 kB | 152 kB | 212992 users | 192 kB | 80 kB | 196608 developer_sessions | 176 kB | 88 kB | 180224 api_keys | 112 kB | 48 kB | 114688 dbix_class_deploymenthandler_versions | 104 kB | 32 kB | 106496 machine_settings | 96 kB | 0 bytes | 98304 gru_dependencies | 88 kB | 48 kB | 90112 machines | 88 kB | 32 kB | 90112 job_group_parents | 80 kB | 32 kB | 81920 secrets | 80 kB | 32 kB | 81920 needle_dirs | 80 kB | 32 kB | 81920 minion_workers | 80 kB | 24 kB | 81920 mojo_migrations | 64 kB | 16 kB | 65536 pgbench_history | 40 kB | 0 bytes | 40960
/srv
is back at 43.5 % so I suppose we can close the ticket.
I've now been restarting all jobs which incompleted due to an API failure since I started the vacuuming because this time it caused several incompletes (but not as many to hit the alert threshold).
#14
Updated by okurz over 1 year ago
- Related to action #101283: [easy][beginner] Retry if webUI times out connection yielding 408 response "Request timeout" added
#15
Updated by okurz over 1 year ago
also https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/20 was merged and made effective now.