action #175518
Updated by livdywan about 1 month ago
## Acceptance criteria * **AC1:** A [Five-Whys](https://en.wikipedia.org/wiki/Five_whys) analysis has been conducted and results documented * **AC2:** Improvements are planned ## Suggestions * Organize a call to conduct the 5 whys (not as part of the retro) * Conduct "Five-Whys" analysis for the topic * Identify follow-up tasks in tickets ## What happened? * Reported as #175464 there were a lot of failures across osd and o3 failing with `setup failure: isotovideo can not be started`. This was on all or almost all workers, not exactly the same time due to the deployment not being instantaneous and we eventually found a broken package was deployed. ## Five Whys 1. Why did we not realize this before the deployments? * This was a patched version of the perl-Mojo-IOLoop-ReadWriteProcess package in devel:openQA. * This was used by openQA-in-openQA. * The error was visible in openQA-in-openQA. * o3 deployments are not requiring openQA-in-openQA tests to pass. * OSD deployment was done due to #175485, fixed meanwhile. 2. Why was the patch not tested outside devel:openQA first? * We basically only created the package branch because we didn't have a CPAN release yet and we didn't think such change could be so problematic. * Partially also because we did not trust the CI checks in https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/ Actually there were and mkittler did a branch, see https://progress.opensuse.org/issues/170209#note-30 referring to https://build.opensuse.org/package/show/home:mkittler:branches:devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess that was *after* the incident. * https://build.opensuse.org/package/live_build_log/devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Tumbleweed/x86_64 * Manual validation on openqaworker14.qa.suse.cz was reported to be successful * Was the patch installed and effectively used? 3. Why did the CI checks fail in https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/61 ? * Unrelated existing failures stopped people looking at test results properly. 4. Why did package checks in OBS not prevent the package from being published? * Because [tests](https://build.opensuse.org/package/show/devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess) where passing. Maybe critical tests are not included? 5. Why was it so much effort to get back to a working state? * The ticket was resolved within a day. * A team was assembled within a couple of hours of the original report. * Both o3+osd were impacted so more than just one infrastructure. OSD was additionally impacted by #175485 * The package was deployed as part of devel:openQA and we don't have a package cache on all systems with the old package version unlike maintenance updates where older versions are available in official repositories. … ## What worked well 1. **W1:** We formed a task force quickly and nicely collaborated relying on our processes for such situations. 2. **W2:** After finding out which package update caused problems and understanding where the previous package version was from we could easily revert using `zypper` 3. **W3:** We were able to make a link to actual development causing the regression quickly 4. **W4:** Even though openQA-in-openQA tests couldn't be run properly due to o3 already being affected openQA-in-openQA still did not succeed as long as the regression persisted ## Ideas 1. **I1:** Make o3 deployments dependant on openQA-in-openQA tests passing. * Re-consider staging instances? * Are on demand staging instances quite similar to personal instances in practice? So little benefit going from past experience. * For more context also see the original "continuous delivery" ticket #18006 "continuous testing + delivery of tested openQA on openSUSE" 2. **I2:** Preserve package cache on worker hosts. * Preserving history of all repos and all packages used on production machines 3. **I3:** Ensure https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/ has 100% passing unit tests in CI 4. **I4:** Explore tooling for checking of deployed versions and bisecting of problematic packages * git across multiple repos can tell us what is being used * salt can tell us what's installed on osd machines * Enhance the existing diff logic used for deployments and use it e.g. with investigations * How can deployed packages on o3 be verified? * Consider using snapshots on all machines * https://wiki.archlinux.org/title/Etckeeper * Use containers to ensure we know exactly what is deployed everywhere?