Project

General

Profile

Actions

action #109864

closed

coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:M

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
2022-04-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

See the original ticket

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Five Whys

  1. Why did all jobs fail unexpectedly?
    • Database contained more data than the design anticipated => Research and investigate how we should anticipate exceeding ID ranges, e.g. re-use IDs within openQA before hitting ID limits (additional advantage that user-readable IDs don't get too big but oldest OSD job is https://openqa.suse.de/tests/385557 so we don't have a non-fragmented free range so likely not feasible for jobs, for other tables maybe); Likely not useful, just use bigint everywhere where it matters -> covered in #109846 already
  2. Why didn't we see it coming?
    • We don't have alerts based on the amount of id's or rows
    • There is no log message for this => Add alerts for any IDs nearing a limit: Research the industry standard for postgreSQL admins best practices
  3. Why isn't everything using bigint?
    • We rather not waste space by changing all data types => We should research how much additional space and performance cost bigint would really mean, maybe negligible
  4. Why did we need to apply manual repair?
    • Someone had to login via SSH because a PR wouldn't assure us the problem was fixed
    • We don't have a test deployment
    • Deployment itself takes a long time
    • We could not foresee how long a migration would take without having our production database as a realistic example
  5. Why did it take 6 hours to fix?
    • From the report to accute action time passed
    • The team isn't currently spread across multiple timezones. => This was communicated when the team composition changed and we should remind about that
Actions

Also available in: Atom PDF