Project

General

Profile

action #175518

Updated by livdywan about 1 month ago

## Acceptance criteria 
 * **AC1:** A [Five-Whys](https://en.wikipedia.org/wiki/Five_whys) analysis has been conducted and results documented 
 * **AC2:** Improvements are planned 

 ## Suggestions 
 * Organize a call to conduct the 5 whys (not as part of the retro) 
 * Conduct "Five-Whys" analysis for the topic 
 * Identify follow-up tasks in tickets 


 ## What happened? 
 * Reported as #175464 there were a lot of failures across osd and o3 failing with `setup failure: isotovideo can not be started`. This was on all or almost all workers, not exactly the same time due to the deployment not being instantaneous and we eventually found a broken package was deployed. 

 ## Five Whys 
 1. Why did we not realize this before the deployments? 
     * This was a patched version of the perl-Mojo-IOLoop-ReadWriteProcess package in devel:openQA. 
     * This was used by openQA-in-openQA. 
     * The error was visible in openQA-in-openQA. 
     * o3 deployments are not requiring openQA-in-openQA tests to pass. 
     * OSD deployment was done due to #175485, fixed meanwhile. 
 2. Why was the patch not tested outside devel:openQA first? 
     * We basically only created the package branch because we didn't have a CPAN release yet and we didn't think such change could be so problematic. 
     * Partially also because we did not trust the CI checks in  
 https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/ 
      Actually there were and mkittler did a branch, see https://progress.opensuse.org/issues/170209#note-30 referring to https://build.opensuse.org/package/show/home:mkittler:branches:devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess 
     that was *after* the incident. 
     *    https://build.opensuse.org/package/live_build_log/devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Tumbleweed/x86_64 
     * Manual validation on openqaworker14.qa.suse.cz was reported to be successful 
     * Was the patch installed and effectively used? 
 3. Why did the CI checks fail in https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/61 ? 
     * Unrelated existing failures stopped people looking at test results properly. 
 4. Why did package checks in OBS not prevent the package from being published? 
     * Because [tests](https://build.opensuse.org/package/show/devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess) where passing. Maybe critical tests are not included? 
 5. Why was it so much effort to get back to a working state? 
     * The ticket was resolved within a day. 
     * A team was assembled within a couple of hours of the original report. 
     * Both o3+osd were impacted so more than just one infrastructure. OSD was additionally impacted by #175485 
     * The package was deployed as part of devel:openQA and we don't have a package cache on all systems with the old package version unlike maintenance updates where older versions are available in official repositories. 

  ## What worked well 
 1. **W1:** We formed a task force quickly and nicely collaborated relying on our processes for such situations. 
 2. **W2:** After finding out which package update caused problems and understanding where the previous package version was from we could easily revert using `zypper` 
 3. **W3:** We were able to make a link to actual development causing the regression quickly 
 4. **W4:** Even though openQA-in-openQA tests couldn't be run properly due to o3 already being affected openQA-in-openQA still did not succeed as long as the regression persisted 

 ## Ideas 
 1. **I1:** Make o3 deployments dependant on openQA-in-openQA tests passing. 
   * Re-consider staging instances? 
       * Are on demand staging instances quite similar to personal instances in practice? So little benefit going from past experience. 
   * For more context also see the original "continuous delivery" ticket #18006 "continuous testing + delivery of tested openQA on openSUSE" 
 2. **I2:** Preserve package cache on worker hosts. 
     * Preserving history of all repos and all packages used on production machines 
 3. **I3:** Ensure https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/ has 100% passing unit tests in CI 
 4. **I4:** Explore tooling for checking of deployed versions and bisecting of problematic packages 
     * git across multiple repos can tell us what is being used 
     * salt can tell us what's installed on osd machines 
         * Enhance the existing diff logic used for deployments and use it e.g. with investigations 
     * How can deployed packages on o3 be verified? 
     * Consider using snapshots on all machines 
     * https://wiki.archlinux.org/title/Etckeeper 
     * Use containers to ensure we know exactly what is deployed everywhere?

Back