action #112265
closedcoordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids
Just use bigint in all our database tables (for auto-incremented ID-columns) size:M
Description
Motivation¶
See #110136#note-12
Acceptance criteria¶
- AC1: All our database tables in openQA use bigint (or are reviewed if it makes sense)
Suggestions¶
- Prepare pull requests for individual tables and merge one by one and monitor production
- Fix production fallouts
- Conduct lesson learned meetings for everything that has gone wrong :)
Updated by kraih over 2 years ago
Btw. i don't think you have to care about the sequences at all, they seem to already be 64 bit. So just the int fields themselves need to be upgraded to bigint.
Updated by mkittler over 2 years ago
- Subject changed from Just use bigint in all our database tables to Just use bigint in all our database tables (for auto-incremented ID-columns)
@kraih I suppose. For the one table we've already converted there was no extra effort required in that regard.
Updated by mkittler over 2 years ago
I've just checked myself (out of curiosity) and openqa=> SELECT * FROM information_schema.sequences ORDER BY sequence_name;
shows indeed that all sequences are bigint
with a max. value of 9223372036854775807. That's at least true for OSD and o3. However, my local database has actually still integer sequences - also after running the migration that changes the corresponding IDs:
openqa-local-bigint=> SELECT * FROM information_schema.sequences ORDER BY data_type LIMIT 6;
sequence_catalog | sequence_schema | sequence_name | data_type | numeric_precision | numeric_precision_radix | numeric_scale | start_value | minimum_value | maximum_value | increment | cycle_option
---------------------+-----------------+-----------------------+-----------+-------------------+-------------------------+---------------+-------------+---------------+---------------------+-----------+--------------
openqa-local-bigint | public | minion_workers_id_seq | bigint | 64 | 2 | 0 | 1 | 1 | 9223372036854775807 | 1 | NO
openqa-local-bigint | public | minion_locks_id_seq | bigint | 64 | 2 | 0 | 1 | 1 | 9223372036854775807 | 1 | NO
openqa-local-bigint | public | minion_jobs_id_seq | bigint | 64 | 2 | 0 | 1 | 1 | 9223372036854775807 | 1 | NO
openqa-local-bigint | public | assets_id_seq | integer | 32 | 2 | 0 | 1 | 1 | 2147483647 | 1 | NO
openqa-local-bigint | public | audit_events_id_seq | integer | 32 | 2 | 0 | 1 | 1 | 2147483647 | 1 | NO
openqa-local-bigint | public | gru_tasks_id_seq | integer | 32 | 2 | 0 | 1 | 1 | 2147483647 | 1 | NO
Only the sequences from Minion are bigint
here. I suppose to cover "legacy"¹ setups the migration should handle sequences after all.
¹Not sure what makes my setup special.
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
Updated by kraih over 2 years ago
mkittler wrote:
However, my local database has actually still integer sequences - also after running the migration that changes the corresponding IDs:
Same here actually. My local O3 clone has 8 byte sequences, while the empty dev database mostly still has 4 byte sequences.
Updated by okurz over 2 years ago
https://github.com/os-autoinst/openQA/pull/4701 merged. The deployment on o3 was triggered soon after and completed after a mere 5 minutes, see https://github.com/os-autoinst/openQA/pull/4701#issuecomment-1159404851, triggering a deployment on OSD explicitly so that I can monitor while I am available: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/416342. Otherwise the next deployment would be triggered tomorrow morning, likely also fine.
https://openqa.opensuse.org/tests/2421090#step/start_test/7 just hit the short time window where o3 wasn't available. I labeled and retriggered the job so that the OSD deployment will find all green openQA-in-openQA tests.
On OSD sudo journalctl --since="2022-06-18 11:39" -u openqa-webui
shows
Jun 18 12:39:27 openqa openqa[4691]: [warn] [pid:4691] Job IDs will be converted to bigint. That may take multiple hours on big databases. It is safe to stop the service (and start from >
Jun 18 12:57:31 openqa openqa[4691]: [info] Loading external plugin AMQP
so 18m for migration. Some alerts about minion jobs have been triggered but shortly went back to ok again. I guess such alerts are ok for such exceptional migrations.
Updated by okurz over 2 years ago
- Related to coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend added
Updated by mkittler over 2 years ago
Pending PR: https://github.com/os-autoinst/openQA/pull/4705
Updated by okurz over 2 years ago
- Subject changed from Just use bigint in all our database tables (for auto-incremented ID-columns) to Just use bigint in all our database tables (for auto-incremented ID-columns) size:M
- Description updated (diff)
Updated by okurz over 2 years ago
merged yesterday. o3 was auto-deployed and migrated and it looked fine. Today OSD was auto-deployed and also looks fine. https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1657431899773&to=1657433289012&viewPanel=78 shows the migration for http response time. Database rows in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1657429493800&to=1657436565913&viewPanel=89 also show the immediate effect of migration but nothing significant after the migration finished. https://monitor.qa.suse.de/alerting/list?state=not_ok shows no alerts.
EDIT: Reviewed some QA SLE maintenance update tests with retriggering and such and everything works as expected
Updated by mkittler over 2 years ago
Yes, seems like everything works as expected. Just to be sure, I've ran VACUUM ANALYZE
on all affected tables. That shouldn't harm and previously ensured the query planner uses indexes correctly.
Updated by AdamWill about 2 years ago
well, er, this didn't go so well on our staging instance:
Nov 26 02:27:59 openqa-lab01.iad2.fedoraproject.org openqa-webui-daemon[10465]: [warn] Job IDs will be converted to bigint. That may take multiple hours on big databases. It is safe to stop the service (and start from scratch on the next startup).
Nov 26 02:30:36 openqa-lab01.iad2.fedoraproject.org openqa-webui-daemon[10465]: failed to run SQL in /usr/share/openqa/script/../dbicdh/PostgreSQL/upgrade/94-95/001-auto.sql: DBIx::Class::DeploymentHandler::DeployMethod::SQL::Translator::try {...} (): DBI Exception: DBD::Pg::db do failed: ERROR: index row size 2712 exceeds btree version 4 maximum 2704 for index "idx_job_value_settings"
Nov 26 02:30:36 openqa-lab01.iad2.fedoraproject.org openqa-webui-daemon[10465]: DETAIL: Index row references tuple (141486,37) in relation "job_settings".
Nov 26 02:30:36 openqa-lab01.iad2.fedoraproject.org openqa-webui-daemon[10465]: HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Nov 26 02:30:36 openqa-lab01.iad2.fedoraproject.org openqa-webui-daemon[10465]: Consider a function index of an MD5 hash of the value, or use full text indexing. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->upgrade_single_step (attribute declared in /usr/share/perl5/vendor_perl/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
Nov 26 02:30:36 openqa-lab01.iad2.fedoraproject.org openqa-webui-daemon[10465]: (running line 'ALTER TABLE job_settings ALTER COLUMN id TYPE bigint') at /usr/share/perl5/vendor_perl/DBIx/Class/DeploymentHandler/DeployMethod/SQL/Translator.pm line 263.
Nov 26 02:30:36 openqa-lab01.iad2.fedoraproject.org openqa-webui-daemon[10465]: DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at /usr/share/openqa/script/openqa line 0
Nov 26 02:30:36 openqa-lab01.iad2.fedoraproject.org openqa-webui-daemon[10465]: DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at /usr/share/openqa/script/openqa line 0
Updated by AdamWill about 2 years ago
googling for that error suggests this may be tricky to resolve and involve, uh, knowing about databases and stuff. and it's friday night, so for now I've downgraded openQA back to a June snapshot. Will try and figure this out on Monday. help appreciated...
Updated by AdamWill about 2 years ago
I filed https://progress.opensuse.org/issues/121054 for my issue.