Actions
action #109864
closedcoordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids
Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:M
Description
Observation¶
See the original ticket
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
Five Whys¶
- Why did all jobs fail unexpectedly?
- Database contained more data than the design anticipated => Research and investigate how we should anticipate exceeding ID ranges, e.g. re-use IDs within openQA before hitting ID limits (additional advantage that user-readable IDs don't get too big but oldest OSD job is https://openqa.suse.de/tests/385557 so we don't have a non-fragmented free range so likely not feasible for jobs, for other tables maybe); Likely not useful, just use bigint everywhere where it matters -> covered in #109846 already
- Why didn't we see it coming?
- We don't have alerts based on the amount of id's or rows
- There is no log message for this => Add alerts for any IDs nearing a limit: Research the industry standard for postgreSQL admins best practices
- Why isn't everything using bigint?
- We rather not waste space by changing all data types => We should research how much additional space and performance cost bigint would really mean, maybe negligible
- Why did we need to apply manual repair?
- Someone had to login via SSH because a PR wouldn't assure us the problem was fixed
- We don't have a test deployment
Deployment itself takes a long time- We could not foresee how long a migration would take without having our production database as a realistic example
- Why did it take 6 hours to fix?
- From the report to accute action time passed
- The team isn't currently spread across multiple timezones. => This was communicated when the team composition changed and we should remind about that
Updated by livdywan over 2 years ago
- Subject changed from Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" to Conduct Five Whys for "All Jobs on OSD are incomplete since 2022-04-12" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan over 2 years ago
- Description updated (diff)
- Status changed from Workable to Resolved
Conducted just now. Apologies for cutting the process short.
Updated by jstehlik over 2 years ago
Thank you @okurz for spotlighting the importance of covering multiple timezones. I spoke about it with Ralf and we are looking for someone based in China who could reinforce the tools team at least in the error mitigation topic.
Actions