Project

General

Profile

coordination #109846

[epic] Ensure all our database tables accomodate enough data, e.g. bigint for id's

Added by okurz 3 months ago. Updated 3 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-04-12
Due date:
2022-07-07
% Done:

33%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Motivation

See #109836 where we found all jobs on OSD are incomplete with the reason 'Reason: api failure: 400 response: OpenQA::Schema::Result::Jobs::insert_module(): DBI Exception: DBD::Pg::st execute failed: ERROR: integer out of range [for Statement " INSERT INTO job_modules ( job_id, name, category, script, milestone, important, fatal, always_rollback, t_created, t_u…' on 2022-04-12 morning due to hitting the integer limit of the job module id. This was mitigated for now on OSD with manual database alterations but needs to be accomodated in our table creations including migrations of existing instances.

Acceptance criteria

  • AC1: New instances of openQA can handle multi-billion job module entries
  • AC2: Existing instances of openQA are migrated/upgraded according to AC1
  • AC3: No severe performance regression
  • AC4: The migration of existing instances finishes in a reasonable time (downtime of some minutes is ok)

Ideas

What okurz did in #109836 manually effectively is in SQL:

ALTER TABLE job_modules ALTER COLUMN id TYPE bigint;
ALTER TABLE needles ALTER COLUMN last_seen_module_id TYPE bigint;
ALTER TABLE needles ALTER COLUMN last_matched_module_id TYPE bigint;

and then restart services:

systemctl restart openqa-webui openqa-scheduler

coolo: "We should do the same in the package and then update the schema version manually on osd database. After that we can check what other tables deserve the same treatment and create one more schema version"

  • From #109864: Add alerts for any IDs nearing a limit: Research the industry standard for postgreSQL admins best practices -> #110136
  • From #109864: We should research how much additional space and performance cost bigint would really mean, maybe negligible -> #110136
  • From #109864: Recovery took multiple hours because the team isn't currently spread across multiple timezones: This was communicated when the team composition changed and we should remind about that -> Done with a chat notice in https://suse.slack.com/archives/C02CANHLANP/p1650450978321009

Some references that might help:


Subtasks

action #109849: Migrate job_modules table to bigintResolvedmkittler

action #109864: Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:MResolvedcdywan

action #110136: [research][timeboxed:10h] Add alerts for any database IDs nearing a limit: Research the industry standard for postgreSQL software development and admins best practices size:SResolvedkraih

action #112265: Just use bigint in all our database tables (for auto-incremented ID-columns)Feedbackmkittler

openQA Infrastructure - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekendResolvedokurz

action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"Resolvedokurz

openQA Infrastructure - action #112733: Webui Summary dashboard is missing IO panelsNew

action #112736: Better alert based on 2022-06-18 incidentNew

openQA Infrastructure - action #112769: errors from telegraf on osd webUI "[inputs.procstat] Error in plugin: open /sys/fs/cgroup/systemd/system.slice/salt-master.service/cgroup.procs: no such file or directory", same for openqa-scheduler.serviceNew

openQA Infrastructure - action #112787: Use different users for our database accesses so that we can debug where load comes fromNew

openQA Infrastructure - action #112850: Make pgstatstatements persistent on OSD (in salt)New

openQA Infrastructure - action #112916: postgresql.conf is invalid after recent salt changes size:MFeedbackokurz

coordination #112961: [epic] Followup to "openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"New

action #112874: Jobs stuck in assigned, worker reports to be "currently stopping" for > 21hNew

openQA Infrastructure - action #112943: Consider running explicitly triggered `vacuum ANALYZE` after every database migrationNew

action #112946: Extend openQA documentation with best practices what to do after migration, e.g. look at pg_statsNew

action #112949: Exercise to switch off non-critical features and conduct emergency drillsNew


Related issues

Copied from openQA Project - action #109836: All Jobs on OSD are incomplete since 2022-04-12Resolved2022-04-12

History

#1 Updated by okurz 3 months ago

  • Copied from action #109836: All Jobs on OSD are incomplete since 2022-04-12 added

#2 Updated by okurz 3 months ago

  • Description updated (diff)

#3 Updated by okurz 3 months ago

  • Description updated (diff)

#4 Updated by okurz 3 months ago

  • Status changed from New to Blocked

#5 Updated by okurz 2 months ago

  • Description updated (diff)

Added ideas from five-whys

#6 Updated by okurz 2 months ago

  • Description updated (diff)

#7 Updated by okurz 2 months ago

  • Description updated (diff)

Also available in: Atom PDF