Project

General

Profile

action #109864

Updated by livdywan over 2 years ago

## Observation 

 See the original ticket 

 ## Acceptance criteria 
 * **AC1:** A [Five-Whys](https://en.wikipedia.org/wiki/Five_whys) analysis has been conducted and results documented 
 * **AC2:** Improvements are planned 

 ## Suggestions 
 * Bring up in retro 
 * Conduct "Five-Whys" analysis for the topic 
 * Identify follow-up tasks in tickets 
 * Organize a call to conduct the 5 whys (not as part of the retro) 

 ## Five Whys 

 1. Why did all jobs fail unexpectedly? Why...? 
  * Database contained more data than the design anticipated 
   => Research and investigate how we should anticipate exceeding ID ranges, e.g. re-use IDs within openQA before hitting ID limits (additional advantage that user-readable IDs don't get too big but oldest OSD job is https://openqa.suse.de/tests/385557 so we don't have a non-fragmented free range so likely not feasible for jobs, for other tables maybe); Likely not useful, just use bigint everywhere where it matters -> covered in #109846 already ... 
 2. Why didn't we see it coming? Why...? 
  * We don't have alerts based on the amount of id's or rows 
  * There is no log message for this 
   => Add alerts for any IDs nearing a limit: Research the industry standard for postgreSQL admins best practices ... 
 3. Why isn't everything using bigint? Why...? 
   * We rather not waste space by changing all data types 
   => We should research how much additional space and performance cost bigint would really mean, maybe negligible ... 
 4. Why did we need to apply manual repair? Why...? 
  * Someone had to login via SSH because a PR wouldn't assure us the problem was fixed 
  * We don't have a test deployment 
  * ~~Deployment itself takes a long time~~ 
  * We could not foresee how long a migration would take without having our production database as a realistic example ... 
 5. Why did it take 6 hours to fix? Why...? 
  * From the report to accute action time passed 
  * The team isn't currently spread across multiple timezones. 
   => This was communicated when the team composition changed and we should remind about that ...

Back