Actions
action #134723
closedConduct lessons learned for making new-lines in script_run fatal size:S
Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-07-02
Due date:
2023-09-27
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
The stricter handling of new-lines in os-autoinst resulted in some non-trivial breakage of tests. Apparently this hadn't been clear to everyone involved and may not have been communicated properly. As follows a timeline from the change towards the revert having been deployed. This should be the basis for conversation and an opportunity to gather feedback from different stakeholders.
Timeline¶
- Original PR https://github.com/os-autoinst/os-autoinst/pull/2342
- Proposed on July 11
- Written warning in PR saying
This PR will make some bad code in our tests blow up. That is intentional.
and not ready label
- August 22
- not ready label removed after additional reviews
- https://suse.slack.com/archives/C02CANHLANP/p1692715267069899
- August 24
- Related test failures are being discovered and tracked down
- https://suse.slack.com/archives/C02CANHLANP/p1692864203095439
- https://suse.slack.com/archives/C02CANHLANP/p1692879684876059
- OBS downtime
- Revert https://github.com/os-autoinst/os-autoinst/pull/2355
- Fixes for the tests with new-lines were proposed
- August 25 https://suse.slack.com/archives/C02CANHLANP/p1692952660092829
Questions¶
- Why wasn't it clear to everyone that this was going to break some tests?
- The fail rate before and after was unclear
- Maybe the PR title and description was a bit short?
- It was stated that some tests would break
- Was it clearly coordinated e.g. in Slack for those not tracking the PR?
- Instead of a fatal error there could have been a log message?
- Maybe nobody would pay attention to such messages?
- It could have softfailed... although that is not how softfails should ideally be used, since they're about the product
- How about flagging certain log messages in the investigation tab or at the top of the job
- A test setting like NEW_MULTI_LINE_CHECK could have been opt-in as teams are available to look into failures
- We can't necessarily coordinate a good time between all stakeholders?
- This should have been clearly announced in the release notes
- Merging and reverting the change soon after had the effect of revealing all of the related issues
- Why was the merge not scheduled smarter?
- A lot of updates were in the queue at the time
- The move to Prague was still on-going
- Why was the revert not deployed immediately?
- It was not clear to stakeholders that this was a testapi change e.g. as opposed to a change in tests or hardware setups
- This was expected to break things, so it shouldn't be rolled back too fast. Otherwise we're not seeing all results. This is why we're doing a bark test here, no?
- The error message should have linked to the PR for clarity.
- Maybe by having a unique error message in the log we could have automatically restarted all affected jobs after the rollback? Maybe autoreview (with small changes?) could have been used for that?
- The details of automatic deployment weren't clear to everyone in the team, and we should make sure that is the case in the future
- Maybe we can make it more discoverable if the rollback was applied. There was some people asking about it
- Why aren't the other teams seen as customers?
- "customers" could mean having API versions that can be enabled gradually
- Did we keep in mind external users in the community?
- We shall not break tests and if we have to, we shall make it easy for other teams to decide by themselves when to fix them.
- We covered the 4 points we care about ;-)
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys
Actions