Project

General

Profile

Actions

coordination #109846

closed

[epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

Added by okurz over 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-04-12
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Motivation

See #109836 where we found all jobs on OSD are incomplete with the reason 'Reason: api failure: 400 response: OpenQA::Schema::Result::Jobs::insert_module(): DBI Exception: DBD::Pg::st execute failed: ERROR: integer out of range [for Statement " INSERT INTO job_modules ( job_id, name, category, script, milestone, important, fatal, always_rollback, t_created, t_u…' on 2022-04-12 morning due to hitting the integer limit of the job module id. This was mitigated for now on OSD with manual database alterations but needs to be accomodated in our table creations including migrations of existing instances.

Acceptance criteria

  • AC1: New instances of openQA can handle multi-billion job module entries
  • AC2: Existing instances of openQA are migrated/upgraded according to AC1
  • AC3: No severe performance regression
  • AC4: The migration of existing instances finishes in a reasonable time (downtime of some minutes is ok)

Ideas

What okurz did in #109836 manually effectively is in SQL:

ALTER TABLE job_modules ALTER COLUMN id TYPE bigint;
ALTER TABLE needles ALTER COLUMN last_seen_module_id TYPE bigint;
ALTER TABLE needles ALTER COLUMN last_matched_module_id TYPE bigint;

and then restart services:

systemctl restart openqa-webui openqa-scheduler

coolo: "We should do the same in the package and then update the schema version manually on osd database. After that we can check what other tables deserve the same treatment and create one more schema version"

  • From #109864: Add alerts for any IDs nearing a limit: Research the industry standard for postgreSQL admins best practices -> #110136
  • From #109864: We should research how much additional space and performance cost bigint would really mean, maybe negligible -> #110136
  • From #109864: Recovery took multiple hours because the team isn't currently spread across multiple timezones: This was communicated when the team composition changed and we should remind about that -> Done with a chat notice in https://suse.slack.com/archives/C02CANHLANP/p1650450978321009

Some references that might help:


Subtasks 16 (0 open16 closed)

action #109849: Migrate job_modules table to bigintResolvedmkittler2022-04-12

Actions
action #109864: Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:MResolvedlivdywan2022-04-12

Actions
action #110136: [research][timeboxed:10h] Add alerts for any database IDs nearing a limit: Research the industry standard for postgreSQL software development and admins best practices size:SResolvedkraih2022-04-20

Actions
action #112265: Just use bigint in all our database tables (for auto-incremented ID-columns) size:MResolvedmkittler2022-06-10

Actions
openQA Infrastructure (public) - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekendResolvedokurz2022-06-22

Actions
action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"Resolvedokurz2022-06-22

Actions
openQA Infrastructure (public) - action #112733: Webui Summary dashboard in Grafana is missing I/O panels size:MResolvednicksinger2022-06-20

Actions
action #112736: Better alert based on 2022-06-18 incident size:MResolvedmkittler2022-06-20

Actions
openQA Infrastructure (public) - action #112769: errors from telegraf on osd webUI "[inputs.procstat] Error in plugin: open /sys/fs/cgroup/systemd/system.slice/salt-master.service/cgroup.procs: no such file or directory", same for openqa-scheduler.serviceRejectedokurz2022-06-21

Actions
openQA Infrastructure (public) - action #112916: postgresql.conf is invalid after recent salt changes size:MResolvedokurz2022-06-22

Actions
openQA Infrastructure (public) - action #113671: [timeboxed][10h] Configure write of I/O panels to be on the negative Y-axis again once we're on grafana 8.4 size:SResolvedmkittler2022-06-20

Actions
openQA Infrastructure (public) - coordination #113674: [epic] Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:MResolvedokurz2023-01-09

Actions
openQA Infrastructure (public) - action #122842: Configure I/O alerts again for the webui after migrating to the "unified alerting" in grafana size:MResolvedokurz2023-01-09

Actions
openQA Infrastructure (public) - action #122845: Migrate our Grafana setup to "unified alerting"Resolvednicksinger2023-01-09

Actions
openQA Infrastructure (public) - action #122848: Configure grouped alerts in Grafana correctly size:MResolvedokurz2023-01-09

Actions
openQA Infrastructure (public) - action #125642: Manage "unified alerting" via salt size:MResolvedmkittler2023-01-09

Actions

Related issues 2 (1 open1 closed)

Copied from openQA Project (public) - action #109836: All Jobs on OSD are incomplete since 2022-04-12Resolvedokurz2022-04-12

Actions
Copied to openQA Project (public) - coordination #128366: [epic] further improvement after we did ensure all our database tables accomodate enough dataNew2022-06-22

Actions
Actions #1

Updated by okurz over 2 years ago

  • Copied from action #109836: All Jobs on OSD are incomplete since 2022-04-12 added
Actions #2

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #3

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #4

Updated by okurz over 2 years ago

  • Status changed from New to Blocked
Actions #5

Updated by okurz over 2 years ago

  • Description updated (diff)

Added ideas from five-whys

Actions #6

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #7

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #8

Updated by tinita over 2 years ago

  • Subject changed from [epic] Ensure all our database tables accomodate enough data, e.g. bigint for id's to [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids
Actions #9

Updated by okurz over 1 year ago

  • Copied to coordination #128366: [epic] further improvement after we did ensure all our database tables accomodate enough data added
Actions #10

Updated by okurz over 1 year ago

  • Status changed from Blocked to Resolved

All subtasks resolved, all ACs covered, we are good to resolve!

Actions

Also available in: Atom PDF