Project

General

Profile

action #100859

investigate how to optimize /srv data utilization on OSD size:S

Added by nicksinger about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2021-10-12
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Over 80% space of /srv on OSD are used up. Most of this data is used by our postgresql database.
I raised this concern in slack where some possible reasons where discussed. One of them is to figure out why postgresql uses so much data. mkittler mentioned that a fresh database import lowers disk space consumption drastically.

Also see poo#89821 for some history about the alert itself.

Acceptance Criteria

AC1: Alert does not trigger any longer
AC2: Understand why our production database uses the space it uses

Suggestions


Related issues

Related to openQA Infrastructure - action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages)Resolved2021-03-10

Related to openQA Project - action #101283: Retry if webUI times out connection yielding 408 response "Request timeout"New2021-10-21

Copied to openQA Infrastructure - action #100976: specific alert about the size of our database, maybe even database tables?New2021-10-12

Copied to openQA Infrastructure - action #100985: Come up with a way to regularly check job group configs for outliers and misconfiguration, e.g. overly long result retention periodsNew2021-10-12

History

#1 Updated by okurz about 2 months ago

  • Description updated (diff)
  • Target version set to Ready

#2 Updated by okurz about 1 month ago

  • Related to action #89821: alert: PROBLEM Service Alert: openqa.suse.de/fs_/srv is WARNING (flaky, partial recovery with OK messages) added

#3 Updated by mkittler about 1 month ago

  • Subject changed from investigate how to optimize /srv data utilization on OSD to investigate how to optimize /srv data utilization on OSD size:S
  • Status changed from New to Workable

#4 Updated by okurz about 1 month ago

  • Copied to action #100976: specific alert about the size of our database, maybe even database tables? added

#5 Updated by okurz about 1 month ago

  • Copied to action #100985: Come up with a way to regularly check job group configs for outliers and misconfiguration, e.g. overly long result retention periods added

#6 Updated by mkittler about 1 month ago

  • Assignee set to mkittler

#7 Updated by mkittler about 1 month ago

  • Status changed from Workable to Feedback

The database has definitely gotten bigger from 54.0 GiB to 71.4 GiB (factor of 1.322) compared to my last check 7 months ago (see #89821#note-4):

--- /srv
   71,4 GiB [##########] /PSQL                                                                                                                                                                                                                                                                                              
    5,6 GiB [          ] /log
    1,2 GiB [          ]  homes.img
   28,9 MiB [          ] /salt
   12,0 MiB [          ] /pillar
    4,0 KiB [          ] /reactor
    0,0   B [          ] /www
e   0,0   B [          ] /tftpboot
e   0,0   B [          ] /svn
e   0,0   B [          ] /spm
e   0,0   B [          ] /ftp
e   0,0   B [          ] /backup

The size used by logs and home directories even slightly decreased.


Similar to my last check, a lot of the data is consumed by indexes:

openqa=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public';
 table_sizes | index_sizes |         ratio          
-------------+-------------+------------------------
 19 GB       | 52 GB       | 0.35656535508820541697

Note that a local database import takes much less space although we have auto vacuum enabled on OSD (see select relname,last_vacuum, last_autovacuum, last_analyze, last_autoanalyze from pg_stat_user_tables;).

Here are figures from an older import (2021-09-08) which has only 568383 jobs:

openqa-osd=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public';
 table_sizes | index_sizes |         ratio          
-------------+-------------+------------------------
 13 GB       | 17 GB       | 0.79865433261186920688
(1 Zeile)

So in contrast to an almost fresh import, the indexes make the most difference.

By the way, OSD has currently 792339 jobs so in the last month we increased the number of jobs by a factor of 1.394. I'm simply comparing with my local database dump. Note that this is close to the factor of the database growth mentioned before but this factor is for the last 7 months.


That's how it looks in detail:

openqa=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc;
              table_name               | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes 
---------------------------------------+---------------------------+----------------------+------------------------
 job_settings                          | 22 GB                     | 17 GB                |            23407943680
 job_modules                           | 16 GB                     | 12 GB                |            17496236032
 screenshot_links                      | 16 GB                     | 12 GB                |            17401176064
 screenshots                           | 11 GB                     | 7314 MB              |            11747926016
 jobs                                  | 3392 MB                   | 2826 MB              |             3557269504
 jobs_assets                           | 796 MB                    | 619 MB               |              834920448
 audit_events                          | 641 MB                    | 180 MB               |              672374784
 scheduled_products                    | 619 MB                    | 65 MB                |              649363456
 job_dependencies                      | 166 MB                    | 134 MB               |              174505984
 needles                               | 102 MB                    | 81 MB                |              106872832
 comments                              | 59 MB                     | 33 MB                |               62283776
 gru_tasks                             | 21 MB                     | 19 MB                |               21757952
 minion_jobs                           | 18 MB                     | 8920 kB              |               18718720
 assets                                | 17 MB                     | 8232 kB              |               17563648
 job_template_settings                 | 16 MB                     | 11 MB                |               16261120
 job_templates                         | 8040 kB                   | 6392 kB              |                8232960
 workers                               | 7216 kB                   | 2136 kB              |                7389184
 test_suite_settings                   | 6792 kB                   | 3848 kB              |                6955008
 worker_properties                     | 2560 kB                   | 1368 kB              |                2621440
 product_settings                      | 2392 kB                   | 1232 kB              |                2449408
 job_groups                            | 1520 kB                   | 168 kB               |                1556480
 bugs                                  | 1272 kB                   | 384 kB               |                1302528
 test_suites                           | 928 kB                    | 424 kB               |                 950272
 job_locks                             | 672 kB                    | 288 kB               |                 688128
 products                              | 320 kB                    | 160 kB               |                 327680
 job_networks                          | 200 kB                    | 144 kB               |                 204800
 users                                 | 192 kB                    | 80 kB                |                 196608
 developer_sessions                    | 176 kB                    | 88 kB                |                 180224
 api_keys                              | 112 kB                    | 48 kB                |                 114688
 minion_locks                          | 104 kB                    | 48 kB                |                 106496
 dbix_class_deploymenthandler_versions | 104 kB                    | 32 kB                |                 106496
 machine_settings                      | 96 kB                     | 0 bytes              |                  98304
 machines                              | 88 kB                     | 32 kB                |                  90112
 gru_dependencies                      | 88 kB                     | 48 kB                |                  90112
 job_group_parents                     | 80 kB                     | 32 kB                |                  81920
 minion_workers                        | 80 kB                     | 24 kB                |                  81920
 needle_dirs                           | 80 kB                     | 32 kB                |                  81920
 secrets                               | 80 kB                     | 32 kB                |                  81920
 mojo_migrations                       | 64 kB                     | 16 kB                |                  65536
 pgbench_history                       | 40 kB                     | 0 bytes              |                  40960
(40 Zeilen)

Long story short: Job-related tables are the problem.

By the way, in my local dump from 2021-09-08 pg_size_pretty doesn't even use GB:

openqa-osd=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc;
              table_name               | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes 
---------------------------------------+---------------------------+----------------------+------------------------
 job_settings                          | 9936 MB                   | 6486 MB              |            10418798592
 screenshot_links                      | 6847 MB                   | 3789 MB              |             7179689984
 screenshots                           | 6491 MB                   | 3187 MB              |             6806061056
 job_modules                           | 5650 MB                   | 2990 MB              |             5924634624
 scheduled_products                    | 485 MB                    | 45 MB                |              508469248
 jobs                                  | 368 MB                    | 195 MB               |              385761280
 audit_events                          | 313 MB                    | 52 MB                |              328695808
…

coolo mentioned in the chat that he tried a few things. All the OSD figures mentioned so far have been collected after that.

Vacuum wasn't taking long and didn't help

we just have 840K jobs alive - this is just insane
63788 alone in "Maintenance: SLE 15 GA HPC Incidents"

I limited it to 21 days now - this should remove the bulk submission from september. But I guess we should align all maintenance incident groups that way

This freed 3GB! Just the HPC Incidents

The full thread is here: https://suse.slack.com/archives/C02AJ1E568M/p1634030652186600

There's also another thread: https://suse.slack.com/archives/C02AJ1E568M/p1634206976332400

I ran vacuum screenshots - took <5 minutes

Before:
bugs                                 | 9568 kB       |               9797632
After:
bugs                                 | 1240 kB       |               1269760

This explanation should fulfill AC2.

#8 Updated by mkittler about 1 month ago

Note that AC1 is technically also fulfilled because the threshold in our monitoring for /srv is 90 %. I assume nicksinger it mentioning a different alert, e.g. from infra. The current usage is still slightly over 80 %.

#9 Updated by mkittler about 1 month ago

In conclusion I'd say we can either reduce the number of jobs we store or we demand more disk space.

Alternatively we could also try to vacuum more explicitly despite the fact that autovacuum should already cover us. I'm not sure whether we'll actually be able to free much disk space this way and whether it is practical to do for the big tables. I'll give https://www.postgresql.org/docs/12/routine-vacuuming.html a read.

#10 Updated by okurz about 1 month ago

mkittler wrote:

In conclusion I'd say we can either reduce the number of jobs we store or we demand more disk space.

Alternatively we could also try to vacuum more explicitly despite the fact that autovacuum should already cover us

coolo ran a manual vacuum full for the "top 3 - top 8 tables". screenshot_links went from 11G to 3G, took 4 minutes (https://suse.slack.com/archives/C02AJ1E568M/p1634243304354100?thread_ts=1634030652.186600&cid=C02AJ1E568M). Now we use 62% (62G) so also increasing the disk size is not necessary right now.

If we have an icinga alert then that should be higher than our grafana alert threshold.
For other partitions we already have adjusted the limits and made them more consistent to mostly 92% warn and 95% alert limit: https://gitlab.suse.de/OPS-Service/monitoring/-/blob/master/check_mk/nue-cmk/main.mk#L51

https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=openqa.suse.de shows an overview and there are two services fs_/srv and fs_/space-slow where I think we should also have those thresholds:

Created https://gitlab.suse.de/OPS-Service/monitoring/-/merge_requests/20 for that. The strange this is that #89821 referenced https://infra.nue.suse.com/SelfService/Display.html?id=186868 which should have done that already but I don't see it in git and also not in https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=openqa.suse.de

#11 Updated by mkittler about 1 month ago

When I look at thruk for fs_/srv and fs_/space-slow it looks like the alert is at 95 % (by dividing the size by "critical at … GiB"). So I'm wondering where the alert nicksinger saw comes from.

It is good to know that full vacuuming helps, in particular with the index tables:

openqa=> select pg_size_pretty(sum(pg_table_size(quote_ident(table_name)))) as table_sizes, pg_size_pretty(sum(pg_indexes_size(quote_ident(table_name)))) as index_sizes, (sum(pg_table_size(quote_ident(table_name))) / sum(pg_indexes_size(quote_ident(table_name)))) as ratio from information_schema.tables where table_schema = 'public';
 table_sizes | index_sizes |         ratio          
-------------+-------------+------------------------
 18 GB       | 38 GB       | 0.46636455882797356600
(1 Zeile)

coolo said he did not ran the full vaccuming on the two biggest tables. So I assume job_modules and job_settings are still left to optimize.

the screenshot_links table was 11G before and is 3G after - and it took ~5 minutes to compact

So job_modules will take about 7.3 minutes and job_settings 10 minutes (if it scales linearly). That's possibly causing a downtime which will lead to incomplete jobs.

#12 Updated by nicksinger about 1 month ago

mkittler wrote:

When I look at thruk for fs_/srv and fs_/space-slow it looks like the alert is at 95 % (by dividing the size by "critical at … GiB"). So I'm wondering where the alert nicksinger saw comes from.

Nah we never saw a critical at https://nagios-devel.suse.de/pnp4nagios/graph?host=openqa.suse.de&srv=fs__srv&view=4&theme=smoothness but rather a "warning" which was reason enough for me to open up this ticket here. Acting only on "criticals" can sometimes be a little bit late if more investigation is needed. But indeed we're clearly under our warning threshold again.

#13 Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

I vacuumed the two remaining tables yesterday evening and it helped:

openqa=> VACUUM FULL VERBOSE job_modules;
INFO:  vacuuming "public.job_modules"
INFO:  "job_modules": found 382300 removable, 28551735 nonremovable row versions in 529816 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 31.22 s, system: 9.92 s, elapsed: 50.54 s.

openqa=> VACUUM FULL VERBOSE job_settings;
INFO:  vacuuming "public.job_settings"
INFO:  "job_settings": found 375941 removable, 39664991 nonremovable row versions in 666744 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 35.16 s, system: 14.95 s, elapsed: 91.57 s.

openqa=> select table_name, pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) as size_including_indexes_gb, pg_size_pretty(pg_indexes_size(quote_ident(table_name))) as size_only_indexes_gb, pg_total_relation_size(quote_ident(table_name)) as size_including_indexes from information_schema.tables where table_schema = 'public' order by size_including_indexes desc;
              table_name               | size_including_indexes_gb | size_only_indexes_gb | size_including_indexes
---------------------------------------+---------------------------+----------------------+------------------------
 job_settings                          | 11 GB                     | 7610 MB              |            12255084544
 screenshot_links                      | 8595 MB                   | 4816 MB              |             9012715520
 job_modules                           | 7476 MB                   | 3946 MB              |             7838711808
 screenshots                           | 6612 MB                   | 3184 MB              |             6932815872
 jobs                                  | 853 MB                    | 526 MB               |              894287872
 scheduled_products                    | 611 MB                    | 53 MB                |              640925696
 audit_events                          | 538 MB                    | 92 MB                |              563773440
 jobs_assets                           | 371 MB                    | 193 MB               |              389275648
 needles                               | 102 MB                    | 81 MB                |              106905600
 job_dependencies                      | 85 MB                     | 57 MB                |               88604672
 comments                              | 64 MB                     | 34 MB                |               66936832
 workers                               | 41 MB                     | 10 MB                |               43474944
 minion_jobs                           | 25 MB                     | 15 MB                |               25690112
 assets                                | 23 MB                     | 12 MB                |               24035328
 gru_tasks                             | 21 MB                     | 19 MB                |               21757952
 job_template_settings                 | 16 MB                     | 11 MB                |               16261120
 job_templates                         | 8040 kB                   | 6392 kB              |                8232960
 test_suite_settings                   | 6792 kB                   | 3848 kB              |                6955008
 worker_properties                     | 2584 kB                   | 1392 kB              |                2646016
 product_settings                      | 2392 kB                   | 1232 kB              |                2449408
 job_groups                            | 1520 kB                   | 168 kB               |                1556480
 bugs                                  | 1480 kB                   | 424 kB               |                1515520
 test_suites                           | 928 kB                    | 424 kB               |                 950272
 job_locks                             | 896 kB                    | 592 kB               |                 917504
 products                              | 320 kB                    | 160 kB               |                 327680
 minion_locks                          | 272 kB                    | 176 kB               |                 278528
 job_networks                          | 208 kB                    | 152 kB               |                 212992
 users                                 | 192 kB                    | 80 kB                |                 196608
 developer_sessions                    | 176 kB                    | 88 kB                |                 180224
 api_keys                              | 112 kB                    | 48 kB                |                 114688
 dbix_class_deploymenthandler_versions | 104 kB                    | 32 kB                |                 106496
 machine_settings                      | 96 kB                     | 0 bytes              |                  98304
 gru_dependencies                      | 88 kB                     | 48 kB                |                  90112
 machines                              | 88 kB                     | 32 kB                |                  90112
 job_group_parents                     | 80 kB                     | 32 kB                |                  81920
 secrets                               | 80 kB                     | 32 kB                |                  81920
 needle_dirs                           | 80 kB                     | 32 kB                |                  81920
 minion_workers                        | 80 kB                     | 24 kB                |                  81920
 mojo_migrations                       | 64 kB                     | 16 kB                |                  65536
 pgbench_history                       | 40 kB                     | 0 bytes              |                  40960

/srv is back at 43.5 % so I suppose we can close the ticket.

I've now been restarting all jobs which incompleted due to an API failure since I started the vacuuming because this time it caused several incompletes (but not as many to hit the alert threshold).

#14 Updated by okurz about 1 month ago

  • Related to action #101283: Retry if webUI times out connection yielding 408 response "Request timeout" added

#15 Updated by okurz about 1 month ago

Also available in: Atom PDF