action #134723: Conduct lessons learned for making new-lines in script_run fatal size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #134723

closed

Conduct lessons learned for making new-lines in script_run fatal size:S

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-07-02

Due date:

2023-09-27

% Done:

Estimated time:

Tags:

lessons learned, five why

Description

Observation¶

The stricter handling of new-lines in os-autoinst resulted in some non-trivial breakage of tests. Apparently this hadn't been clear to everyone involved and may not have been communicated properly. As follows a timeline from the change towards the revert having been deployed. This should be the basis for conversation and an opportunity to gather feedback from different stakeholders.

Timeline¶

Original PR https://github.com/os-autoinst/os-autoinst/pull/2342
- Proposed on July 11
- Written warning in PR saying This PR will make some bad code in our tests blow up. That is intentional. and not ready label
August 22
- not ready label removed after additional reviews
- https://suse.slack.com/archives/C02CANHLANP/p1692715267069899
August 24
- Related test failures are being discovered and tracked down
- https://suse.slack.com/archives/C02CANHLANP/p1692864203095439
- https://suse.slack.com/archives/C02CANHLANP/p1692879684876059
- OBS downtime
- Revert https://github.com/os-autoinst/os-autoinst/pull/2355
- Fixes for the tests with new-lines were proposed
August 25 https://suse.slack.com/archives/C02CANHLANP/p1692952660092829

Questions¶

Why wasn't it clear to everyone that this was going to break some tests?
The fail rate before and after was unclear
- Maybe the PR title and description was a bit short?
- It was stated that some tests would break
- Was it clearly coordinated e.g. in Slack for those not tracking the PR?
- Instead of a fatal error there could have been a log message?
- Maybe nobody would pay attention to such messages?
- It could have softfailed... although that is not how softfails should ideally be used, since they're about the product
- How about flagging certain log messages in the investigation tab or at the top of the job
- A test setting like NEW_MULTI_LINE_CHECK could have been opt-in as teams are available to look into failures
- We can't necessarily coordinate a good time between all stakeholders?
- This should have been clearly announced in the release notes
- Merging and reverting the change soon after had the effect of revealing all of the related issues
Why was the merge not scheduled smarter?
- A lot of updates were in the queue at the time
- The move to Prague was still on-going
Why was the revert not deployed immediately?
- It was not clear to stakeholders that this was a testapi change e.g. as opposed to a change in tests or hardware setups
- This was expected to break things, so it shouldn't be rolled back too fast. Otherwise we're not seeing all results. This is why we're doing a bark test here, no?
- The error message should have linked to the PR for clarity.
- Maybe by having a unique error message in the log we could have automatically restarted all affected jobs after the rollback? Maybe autoreview (with small changes?) could have been used for that?
- The details of automatic deployment weren't clear to everyone in the team, and we should make sure that is the case in the future
- Maybe we can make it more discoverable if the rollback was applied. There was some people asking about it
Why aren't the other teams seen as customers?
- "customers" could mean having API versions that can be enabled gradually
- Did we keep in mind external users in the community?
- We shall not break tests and if we have to, we shall make it easy for other teams to decide by themselves when to fix them.
We covered the 4 points we care about ;-)

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #134723

Conduct lessons learned for making new-lines in script_run fatal size:S

Observation¶

Timeline¶

Questions¶

Acceptance criteria¶

Suggestions¶

Updated by mpagot over 1 year ago

Updated by bschmidt over 1 year ago

Updated by tinita over 1 year ago

Updated by mgrifalconi over 1 year ago

Updated by okurz over 1 year ago

Updated by MDoucha over 1 year ago

Updated by MDoucha over 1 year ago

Updated by acarvajal over 1 year ago

Updated by MDoucha over 1 year ago

Updated by favogt over 1 year ago

Updated by okurz over 1 year ago

Updated by mgrifalconi over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by MDoucha over 1 year ago

Updated by bschmidt over 1 year ago

Updated by bschmidt over 1 year ago

Updated by acarvajal over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago