Project

General

Profile

Actions

action #134723

closed

Conduct lessons learned for making new-lines in script_run fatal size:S

Added by livdywan 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-07-02
Due date:
2023-09-27
% Done:

0%

Estimated time:

Description

Observation

The stricter handling of new-lines in os-autoinst resulted in some non-trivial breakage of tests. Apparently this hadn't been clear to everyone involved and may not have been communicated properly. As follows a timeline from the change towards the revert having been deployed. This should be the basis for conversation and an opportunity to gather feedback from different stakeholders.

Timeline

Questions

  1. Why wasn't it clear to everyone that this was going to break some tests?
  2. The fail rate before and after was unclear
    • Maybe the PR title and description was a bit short?
    • It was stated that some tests would break
    • Was it clearly coordinated e.g. in Slack for those not tracking the PR?
    • Instead of a fatal error there could have been a log message?
    • Maybe nobody would pay attention to such messages?
    • It could have softfailed... although that is not how softfails should ideally be used, since they're about the product
    • How about flagging certain log messages in the investigation tab or at the top of the job
    • A test setting like NEW_MULTI_LINE_CHECK could have been opt-in as teams are available to look into failures
    • We can't necessarily coordinate a good time between all stakeholders?
    • This should have been clearly announced in the release notes
    • Merging and reverting the change soon after had the effect of revealing all of the related issues
  3. Why was the merge not scheduled smarter?
    • A lot of updates were in the queue at the time
    • The move to Prague was still on-going
  4. Why was the revert not deployed immediately?
    • It was not clear to stakeholders that this was a testapi change e.g. as opposed to a change in tests or hardware setups
    • This was expected to break things, so it shouldn't be rolled back too fast. Otherwise we're not seeing all results. This is why we're doing a bark test here, no?
    • The error message should have linked to the PR for clarity.
    • Maybe by having a unique error message in the log we could have automatically restarted all affected jobs after the rollback? Maybe autoreview (with small changes?) could have been used for that?
    • The details of automatic deployment weren't clear to everyone in the team, and we should make sure that is the case in the future
    • Maybe we can make it more discoverable if the rollback was applied. There was some people asking about it
  5. Why aren't the other teams seen as customers?
    • "customers" could mean having API versions that can be enabled gradually
    • Did we keep in mind external users in the community?
    • We shall not break tests and if we have to, we shall make it easy for other teams to decide by themselves when to fix them.
  6. We covered the 4 points we care about ;-)

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys

Related issues 2 (2 open0 closed)

Copied to openQA Infrastructure - action #135662: Make the state of deployments (and rollbacks) more discoverableNew2023-07-02

Actions
Copied to openQA Project - action #136127: Implement record warning feature similar to record softfailNew

Actions
Actions

Also available in: Atom PDF