action #109864


coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:M

Added by okurz about 2 years ago. Updated about 2 years ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:



See the original ticket

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned


  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Five Whys

  1. Why did all jobs fail unexpectedly?
    • Database contained more data than the design anticipated => Research and investigate how we should anticipate exceeding ID ranges, e.g. re-use IDs within openQA before hitting ID limits (additional advantage that user-readable IDs don't get too big but oldest OSD job is so we don't have a non-fragmented free range so likely not feasible for jobs, for other tables maybe); Likely not useful, just use bigint everywhere where it matters -> covered in #109846 already
  2. Why didn't we see it coming?
    • We don't have alerts based on the amount of id's or rows
    • There is no log message for this => Add alerts for any IDs nearing a limit: Research the industry standard for postgreSQL admins best practices
  3. Why isn't everything using bigint?
    • We rather not waste space by changing all data types => We should research how much additional space and performance cost bigint would really mean, maybe negligible
  4. Why did we need to apply manual repair?
    • Someone had to login via SSH because a PR wouldn't assure us the problem was fixed
    • We don't have a test deployment
    • Deployment itself takes a long time
    • We could not foresee how long a migration would take without having our production database as a realistic example
  5. Why did it take 6 hours to fix?
    • From the report to accute action time passed
    • The team isn't currently spread across multiple timezones. => This was communicated when the team composition changed and we should remind about that
Actions #1

Updated by livdywan about 2 years ago

  • Subject changed from Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" to Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by livdywan about 2 years ago

  • Description updated (diff)
  • Status changed from Workable to Resolved

Conducted just now. Apologies for cutting the process short.

Actions #3

Updated by okurz about 2 years ago

  • Assignee set to livdywan

you did good :)

Actions #5

Updated by jstehlik about 2 years ago

Thank you @okurz for spotlighting the importance of covering multiple timezones. I spoke about it with Ralf and we are looking for someone based in China who could reinforce the tools team at least in the error mitigation topic.


Also available in: Atom PDF