Project

General

Profile

Actions

action #121054

closed

bigint conversion fails due to idx_job_id_value_settings index being too wide size:S

Added by AdamWill about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-11-28
Due date:
% Done:

0%

Estimated time:

Description

When I update Fedora's staging openQA instance to a newer version of openQA with the bigint database conversion - https://progress.opensuse.org/issues/112265 - the conversion fails:

Nov 28 18:32:54 openqa-lab01.iad2.fedoraproject.org openqa-gru[1042531]: failed to run SQL in /usr/share/openqa/script/../dbicdh/PostgreSQL/upgrade/94-95/001-auto.sql: DBIx::Class::DeploymentHandler::DeployMethod::SQL::Translator::try {...} (): DBI Exception: DBD::Pg::db do failed: ERROR:  index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_job_id_value_settings"
Nov 28 18:32:54 openqa-lab01.iad2.fedoraproject.org openqa-gru[1042531]: DETAIL:  Index row references tuple (149183,23) in relation "job_settings".
Nov 28 18:32:54 openqa-lab01.iad2.fedoraproject.org openqa-gru[1042531]: HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Nov 28 18:32:54 openqa-lab01.iad2.fedoraproject.org openqa-gru[1042531]: Consider a function index of an MD5 hash of the value, or use full text indexing. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->upgrade_single_step (attribute declared in /usr/share/perl5/vendor_perl/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
Nov 28 18:32:54 openqa-lab01.iad2.fedoraproject.org openqa-gru[1042531]:  (running line 'ALTER TABLE job_settings ALTER COLUMN job_id TYPE bigint') at /usr/share/perl5/vendor_perl/DBIx/Class/DeploymentHandler/DeployMethod/SQL/Translator.pm line 263.
Nov 28 18:32:54 openqa-lab01.iad2.fedoraproject.org openqa-gru[1042531]: DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at /usr/share/openqa/script/openqa line 0
Nov 28 18:32:54 openqa-lab01.iad2.fedoraproject.org openqa-gru[1042531]: DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at /usr/share/openqa/script/openqa line 0

I'm no database expert, but after a morning of googling, I think basically we're running into a true hard limit in postgres here. As it says, "Values larger than 1/3 of a buffer page cannot be indexed" (in a btree index - this is a limitation of btree indexes specifically).

I did check I can recreate the same index with current postgresql and the job_id still as 'integer' type, and it works:

openqa-stg=> \d job_settings
                                        Table "public.job_settings"
  Column   |            Type             | Collation | Nullable |                 Default                  
-----------+-----------------------------+-----------+----------+------------------------------------------
 id        | integer                     |           | not null | nextval('job_settings_id_seq'::regclass)
 key       | text                        |           | not null | 
 value     | text                        |           | not null | 
 job_id    | integer                     |           | not null | 
 t_created | timestamp without time zone |           | not null | 
 t_updated | timestamp without time zone |           | not null | 

openqa-stg=> CREATE INDEX idx_job_id_value_settings_test on job_settings (job_id, key, value);
CREATE INDEX
openqa-stg=> 

I guess changing the integer type requires the index to be recreated and makes the index entries larger, and now at least one of them is just too big, or something.

I suspect probably only a few rows are problematic here, and it's likely ones for tests of this update: https://bodhi.fedoraproject.org/updates/FEDORA-2022-23a0a34ea5 . As you can see at e.g. https://openqa.stg.fedoraproject.org/tests/2283227#settings , one of the settings for our update tests (ADVISORY_NVRS) is a list of all the packages in the update, and for an update with a lot of packages in it, that's pretty long. It seems likely it's the rows with that specific setting in it that are triggering this problem.

The indexes were added in 43bac479f37b64437e24e57a65a2ed88cd4f4dfb , which was a part of https://github.com/os-autoinst/openQA/pull/521 , but it seems it was "really" supposed to be part of https://github.com/os-autoinst/openQA/pull/534 (521 just happened to be touching the db schema at the same time so the index additions got rolled into it as well). That suggests the idea was to "Improve query_jobs performance", and coolo said the indexes were "needed" for that.

From that PR, it looks like the query we were trying to speed up is the one that is now query_for_settings in lib/Openqa/Schema/ResultSet/JobSettings.pm - or rather, queries which use that subquery, of which there are currently three, all of which do something similar to this:

    my $subquery = $schema->resultset('JobSettings')->query_for_settings(\%precond);
    $cond{'me.id'} = {-in => $subquery->get_column('job_id')->as_query};

so they basically wind up involving the job_id, key and value, just as the index expects.

It looks to me like that query is usually a strict equality one, except when the query string has colons in it, when it's made into a "like" query (see d83fc0ec16c4868b92915eda1b8af0ee2145ea0f ). If I'm understanding all of this correctly, when that's a strict equality query, the index will help; when it's a "like" query, it probably won't (I think only a full-text index would help there). We don't seem to be doing any queries that use the range capabilities of a btree index, AFAICT, so in theory we could use a hash index, except that hash indexes can't be multicolumn, only B-tree, GiST, GIN, and BRIN indexes.

I did find this discussion of a similar problem: https://www.spinics.net/lists/pgsql/msg212499.html
which suggests creating a btree index in such a way that a hash of the concatenation of the items in each row is used. This would almost certainly solve the size issue, but I don't know if such an index would actually be any use for speeding up the query - would postgresql know it could use such an index to handle our queries? I don't know, but it seems somehow unlikely.

Poking through the postgresql docs, another option I guess we could use is a partial index: https://www.postgresql.org/docs/current/indexes-partial.html . We could make the index only contain settings actually likely to be used in these queries. There is one path where, technically, any arbitrary setting at all might be in the query; that's cancelling a job by settings via the API. API/V1/Job.pm cancel(), if the query doesn't contain a job id, takes all parameters from the query and treats them as settings for a call to cancel_by_settings (which uses query_for_settings) - so you can cancel all jobs with any arbitrary setting(s) you specify. But it feels to me like this probably isn't used a lot. The other two paths that use query_for_settings both specifically limit the settings they will actually query for. _prepare_complex_query_search_args (in ResultSet/Jobs.pm) will only query for qw(ISO HDD_1 WORKER_CLASS). destroy (in API/V1/Iso.pm) will only query for ISO. So we could actually use a partial index that only indexes entries whose "key" is ISO, HDD_1 or WORKER_CLASS, and that should cover most queries. I think. We could add other "standard" settings to the list for the purposes of cancel_by_settings, I guess.

Going the opposite way we could use a partial index that indexes all entries except those whose "key" is ADVISORY_NVRS , I guess. Which would be an extremely specific exemption to carry upstream, but I guess would do the trick...

Actions

Also available in: Atom PDF