action #134723
Updated by livdywan over 1 year ago
## Observation The stricter handling of new-lines in os-autoinst resulted in some non-trivial breakage of tests. Apparently this hadn't been clear to everyone involved and may not have been communicated properly. As follows a timeline from the change towards the revert having been deployed. This should be the basis for conversation and an opportunity to gather feedback from different stakeholders. ### Timeline - Original PR https://github.com/os-autoinst/os-autoinst/pull/2342 - Proposed on July 11 - Written warning in PR saying `This PR will make some bad code in our tests blow up. That is intentional.` and *not ready* label - August 22 - *not ready* label removed after additional reviews - https://suse.slack.com/archives/C02CANHLANP/p1692715267069899 - August 24 - Related test failures are being discovered and tracked down - https://suse.slack.com/archives/C02CANHLANP/p1692864203095439 - https://suse.slack.com/archives/C02CANHLANP/p1692879684876059 - OBS downtime - Revert https://github.com/os-autoinst/os-autoinst/pull/2355 - Fixes for the tests with new-lines were proposed - August 25 https://suse.slack.com/archives/C02CANHLANP/p1692952660092829 ### Questions to be asked here might be: 1. Why wasn't it clear to everyone that this was going to break some tests? - The fail rate before and after was unclear - Maybe the PR title and description was a bit short? - It was stated that some tests would break - Was it clearly coordinated e.g. in Slack for those not tracking the PR? - Instead of a fatal error there could have been a log message? - Maybe nobody would pay attention to such messages? - It could have softfailed... although that is not how softfails should ideally be used, since they're about the product - How about flagging certain log messages in the investigation tab or at the top of the job - A test setting like NEW_MULTI_LINE_CHECK could have been opt-in as teams are available to look into failures - We can't necessarily coordinate a good time between all stakeholders? - This should have been clearly announced in the release notes - Merging and reverting the change soon after had the effect of revealing all of the related issues 2. Why was the merge not scheduled smarter? - A lot of updates were in the queue at the time - The move to Prague was still on-going 3. Why was the revert not deployed immediately? - It was not clear to stakeholders that this was a testapi change e.g. as opposed to a change in tests or hardware setups - This was expected to break things, so it shouldn't be rolled back too fast. Otherwise we're not seeing all results. This is why we're doing a bark test here, no? - The error message should have linked to the PR for clarity. - Maybe by having a unique error message in the log we could have automatically restarted all affected jobs after the rollback? Maybe autoreview (with small changes?) could have been used for that? - The details of automatic deployment weren't clear to everyone in the team, and we should make sure that is the case in the future - Maybe we can make it more discoverable if the rollback was applied. There was some people asking about it 3. 4. Why aren't the other teams seen as customers? - "customers" could mean having API versions that can be enabled gradually - Did we keep in mind external users in the community? - We shall not break tests and if we have to, we shall make it easy for other teams to decide by themselves when to fix them. 5. We covered the 4 points we care about ;-) ## Acceptance criteria * **AC1:** A [Five-Whys](https://en.wikipedia.org/wiki/Five_whys) analysis has been conducted and results documented * **AC2:** Improvements are planned ## Suggestions * Conduct "Five-Whys" analysis for the topic * Identify follow-up tasks in tickets * Organize a call to conduct the 5 whys