action #134723

Updated by livdywan 11 months ago

## Observation 

 The stricter handling of new-lines in os-autoinst resulted in some non-trivial breakage of tests. Apparently this hadn't been clear to everyone involved and may not have been communicated properly. As follows a timeline from the change towards the revert having been deployed. This should be the basis for conversation and an opportunity to gather feedback from different stakeholders. 

 ### Timeline 

 - Original PR 
   - Proposed on July 11 
   - Written warning in PR saying `This PR will make some bad code in our tests blow up. That is intentional.` and *not ready* label 
 - August 22 
   - *not ready* label removed after additional reviews 
 - August 24 
   - Related test failures are being discovered and tracked down 
   - OBS downtime 
   - Revert 
   - Fixes for the tests with new-lines were proposed 
 - August 25 

 ### Questions to be asked here might be: 

 1. Why wasn't it clear to everyone that this was going to break some tests? 
 - The fail rate before and after was unclear 
   - Maybe the PR title and description was a bit short? 
   - It was stated that some tests would break 
   - Was it clearly coordinated e.g. in Slack for those not tracking the PR? 
   - Instead of a fatal error there could have been a log message? 
     - Maybe nobody would pay attention to such messages? 
     - It could have softfailed... although that is not how softfails should ideally be used, since they're about the product 
     - How about flagging certain log messages in the investigation tab or at the top of the job 
     - A test setting like NEW_MULTI_LINE_CHECK could have been opt-in as teams are available to look into failures 
     - We can't necessarily coordinate a good time between all stakeholders? 
     - This should have been clearly announced in the release notes 
   - Merging and reverting the change soon after had the effect of revealing all of the related issues 
 2. Why was the merge not scheduled smarter? 
   - A lot of updates were in the queue at the time 
   - The move to Prague was still on-going 
 3. Why was the revert not deployed immediately? 
   - It was not clear to stakeholders that this was a testapi change e.g. as opposed to a change in tests or hardware setups 
   - This was expected to break things, so it shouldn't be rolled back too fast. Otherwise we're not seeing all results. This is why we're doing a bark test here, no? 
   - The error message should have linked to the PR for clarity. 
   - Maybe by having a unique error message in the log we could have automatically restarted all affected jobs after the rollback? Maybe autoreview (with small changes?) could have been used for that? 
   - The details of automatic deployment weren't clear to everyone in the team, and we should make sure that is the case in the future 
   - Maybe we can make it more discoverable if the rollback was applied. There was some people asking about it 
 4. Why aren't the other teams seen as customers? 
   - "customers" could mean having API versions that can be enabled gradually 
   - Did we keep in mind external users in the community? 
   - We shall not break tests and if we have to, we shall make it easy for other teams to decide by themselves when to fix them. 
 5. We covered the 4 points we care about ;-) 


 ## Acceptance criteria 
 * **AC1:** A [Five-Whys]( analysis has been conducted and results documented 
 * **AC2:** Improvements are planned 

 ## Suggestions 
 * Conduct "Five-Whys" analysis for the topic 
 * Identify follow-up tasks in tickets 
 * Organize a call to conduct the 5 whys