Project

General

Profile

Actions

action #109864

closed

coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids

Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
2022-04-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

See the original ticket

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Five Whys

  1. Why did all jobs fail unexpectedly?
    • Database contained more data than the design anticipated => Research and investigate how we should anticipate exceeding ID ranges, e.g. re-use IDs within openQA before hitting ID limits (additional advantage that user-readable IDs don't get too big but oldest OSD job is https://openqa.suse.de/tests/385557 so we don't have a non-fragmented free range so likely not feasible for jobs, for other tables maybe); Likely not useful, just use bigint everywhere where it matters -> covered in #109846 already
  2. Why didn't we see it coming?
    • We don't have alerts based on the amount of id's or rows
    • There is no log message for this => Add alerts for any IDs nearing a limit: Research the industry standard for postgreSQL admins best practices
  3. Why isn't everything using bigint?
    • We rather not waste space by changing all data types => We should research how much additional space and performance cost bigint would really mean, maybe negligible
  4. Why did we need to apply manual repair?
    • Someone had to login via SSH because a PR wouldn't assure us the problem was fixed
    • We don't have a test deployment
    • Deployment itself takes a long time
    • We could not foresee how long a migration would take without having our production database as a realistic example
  5. Why did it take 6 hours to fix?
    • From the report to accute action time passed
    • The team isn't currently spread across multiple timezones. => This was communicated when the team composition changed and we should remind about that
Actions #1

Updated by livdywan over 2 years ago

  • Subject changed from Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" to Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by livdywan over 2 years ago

  • Description updated (diff)
  • Status changed from Workable to Resolved

Conducted just now. Apologies for cutting the process short.

Actions #3

Updated by okurz over 2 years ago

  • Assignee set to livdywan

you did good :)

Actions #5

Updated by jstehlik over 2 years ago

Thank you @okurz for spotlighting the importance of covering multiple timezones. I spoke about it with Ralf and we are looking for someone based in China who could reinforce the tools team at least in the error mitigation topic.

Actions

Also available in: Atom PDF