Actions
action #175518
closedcoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
coordination #175515: [epic] incomplete jobs with "Failed to find an available port: Address already in use"
Conduct "lessons learned" with Five Why analysis for "jobs incomplete with setup failure: isotovideo can not be started" size:S
Start date:
2025-01-24
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Organize a call to conduct the 5 whys (not as part of the retro)
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
What happened?¶
- Reported as #175464 there were a lot of failures across osd and o3 failing with
setup failure: isotovideo can not be started
. This was on all or almost all workers, not exactly the same time due to the deployment not being instantaneous and we eventually found a broken package was deployed.
Five Whys¶
- Why did we not realize this before the deployments?
- This was a patched version of the perl-Mojo-IOLoop-ReadWriteProcess package in devel:openQA.
- This was used by openQA-in-openQA.
- The error was visible in openQA-in-openQA.
- o3 deployments are not requiring openQA-in-openQA tests to pass.
- OSD deployment was done due to #175485, fixed meanwhile.
- Why was the patch not tested outside devel:openQA first?
- We basically only created the package branch because we didn't have a CPAN release yet and we didn't think such change could be so problematic.
- Partially also because we did not trust the CI checks in https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/ Actually there were and mkittler did a branch, see https://progress.opensuse.org/issues/170209#note-30 referring to https://build.opensuse.org/package/show/home:mkittler:branches:devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess that was after the incident.
- https://build.opensuse.org/package/live_build_log/devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Tumbleweed/x86_64
- Manual validation on openqaworker14.qa.suse.cz was reported to be successful
- Was the patch installed and effectively used?
- Why did the CI checks fail in https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/61 ?
- Unrelated existing failures stopped people looking at test results properly.
- Why did package checks in OBS not prevent the package from being published?
- Because tests where passing. Maybe critical tests are not included?
- Why was it so much effort to get back to a working state?
- The ticket was resolved within a day.
- A team was assembled within a couple of hours of the original report.
- Both o3+osd were impacted so more than just one infrastructure. OSD was additionally impacted by #175485
- The package was deployed as part of devel:openQA and we don't have a package cache on all systems with the old package version unlike maintenance updates where older versions are available in official repositories.
What worked well¶
- W1: We formed a task force quickly and nicely collaborated relying on our processes for such situations.
- W2: After finding out which package update caused problems and understanding where the previous package version was from we could easily revert using
zypper
- W3: We were able to make a link to actual development causing the regression quickly
- W4: Even though openQA-in-openQA tests couldn't be run properly due to o3 already being affected openQA-in-openQA still did not succeed as long as the regression persisted
Ideas¶
- I1: Make o3 deployments dependant on openQA-in-openQA tests passing.
- Re-consider staging instances?
- Are on demand staging instances quite similar to personal instances in practice? So little benefit going from past experience.
- For more context also see the original "continuous delivery" ticket #18006 "continuous testing + delivery of tested openQA on openSUSE"
- Re-consider staging instances?
- I2: Preserve package cache on worker hosts.
- Preserving history of all repos and all packages used on production machines
- I3: Ensure https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/ has 100% passing unit tests in CI
- I4: Explore tooling for checking of deployed versions and bisecting of problematic packages
- git across multiple repos can tell us what is being used
- salt can tell us what's installed on osd machines
- Enhance the existing diff logic used for deployments and use it e.g. with investigations
- How can deployed packages on o3 be verified?
- Consider using snapshots on all machines
- https://wiki.archlinux.org/title/Etckeeper
- Use containers to ensure we know exactly what is deployed everywhere?
Updated by okurz about 1 month ago
- Copied from action #175464: jobs incomplete with auto_review:"setup failure: isotovideo can not be started" added
Updated by mkittler about 1 month ago
- Subject changed from Conduct "lessons learned" with Five Why analysis for "jobs incomplete with setup failure: isotovideo can not be started" to Conduct "lessons learned" with Five Why analysis for "jobs incomplete with setup failure: isotovideo can not be started" size:S
- Status changed from New to Workable
Updated by livdywan about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Let's discuss the topic this afternoon
Updated by openqa_review about 1 month ago
- Due date set to 2025-02-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan about 1 month ago
- Status changed from In Progress to Feedback
Updated by livdywan about 1 month ago
- Status changed from Feedback to Resolved
Follow-up tickets for the ideas filed
Actions