action #109864
Updated by livdywan over 2 years ago
## Observation See the original ticket ## Acceptance criteria * **AC1:** A [Five-Whys](https://en.wikipedia.org/wiki/Five_whys) analysis has been conducted and results documented * **AC2:** Improvements are planned ## Suggestions * Bring up in retro * Conduct "Five-Whys" analysis for the topic * Identify follow-up tasks in tickets * Organize a call to conduct the 5 whys (not as part of the retro) ## Five Whys 1. Why did all jobs fail unexpectedly? Why...? * Database contained more data than the design anticipated => Research and investigate how we should anticipate exceeding ID ranges, e.g. re-use IDs within openQA before hitting ID limits (additional advantage that user-readable IDs don't get too big but oldest OSD job is https://openqa.suse.de/tests/385557 so we don't have a non-fragmented free range so likely not feasible for jobs, for other tables maybe); Likely not useful, just use bigint everywhere where it matters -> covered in #109846 already ... 2. Why didn't we see it coming? Why...? * We don't have alerts based on the amount of id's or rows * There is no log message for this => Add alerts for any IDs nearing a limit: Research the industry standard for postgreSQL admins best practices ... 3. Why isn't everything using bigint? Why...? * We rather not waste space by changing all data types => We should research how much additional space and performance cost bigint would really mean, maybe negligible ... 4. Why did we need to apply manual repair? Why...? * Someone had to login via SSH because a PR wouldn't assure us the problem was fixed * We don't have a test deployment * ~~Deployment itself takes a long time~~ * We could not foresee how long a migration would take without having our production database as a realistic example ... 5. Why did it take 6 hours to fix? Why...? * From the report to accute action time passed * The team isn't currently spread across multiple timezones. => This was communicated when the team composition changed and we should remind about that ...