action #134723: Conduct lessons learned for making new-lines in script_run fatal size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #134723

closed

Conduct lessons learned for making new-lines in script_run fatal size:S

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-07-02

Due date:

2023-09-27

% Done:

Estimated time:

Tags:

lessons learned, five why

Description

Observation¶

The stricter handling of new-lines in os-autoinst resulted in some non-trivial breakage of tests. Apparently this hadn't been clear to everyone involved and may not have been communicated properly. As follows a timeline from the change towards the revert having been deployed. This should be the basis for conversation and an opportunity to gather feedback from different stakeholders.

Timeline¶

Original PR https://github.com/os-autoinst/os-autoinst/pull/2342
- Proposed on July 11
- Written warning in PR saying This PR will make some bad code in our tests blow up. That is intentional. and not ready label
August 22
- not ready label removed after additional reviews
- https://suse.slack.com/archives/C02CANHLANP/p1692715267069899
August 24
- Related test failures are being discovered and tracked down
- https://suse.slack.com/archives/C02CANHLANP/p1692864203095439
- https://suse.slack.com/archives/C02CANHLANP/p1692879684876059
- OBS downtime
- Revert https://github.com/os-autoinst/os-autoinst/pull/2355
- Fixes for the tests with new-lines were proposed
August 25 https://suse.slack.com/archives/C02CANHLANP/p1692952660092829

Questions¶

Why wasn't it clear to everyone that this was going to break some tests?
The fail rate before and after was unclear
- Maybe the PR title and description was a bit short?
- It was stated that some tests would break
- Was it clearly coordinated e.g. in Slack for those not tracking the PR?
- Instead of a fatal error there could have been a log message?
- Maybe nobody would pay attention to such messages?
- It could have softfailed... although that is not how softfails should ideally be used, since they're about the product
- How about flagging certain log messages in the investigation tab or at the top of the job
- A test setting like NEW_MULTI_LINE_CHECK could have been opt-in as teams are available to look into failures
- We can't necessarily coordinate a good time between all stakeholders?
- This should have been clearly announced in the release notes
- Merging and reverting the change soon after had the effect of revealing all of the related issues
Why was the merge not scheduled smarter?
- A lot of updates were in the queue at the time
- The move to Prague was still on-going
Why was the revert not deployed immediately?
- It was not clear to stakeholders that this was a testapi change e.g. as opposed to a change in tests or hardware setups
- This was expected to break things, so it shouldn't be rolled back too fast. Otherwise we're not seeing all results. This is why we're doing a bark test here, no?
- The error message should have linked to the PR for clarity.
- Maybe by having a unique error message in the log we could have automatically restarted all affected jobs after the rollback? Maybe autoreview (with small changes?) could have been used for that?
- The details of automatic deployment weren't clear to everyone in the team, and we should make sure that is the case in the future
- Maybe we can make it more discoverable if the rollback was applied. There was some people asking about it
Why aren't the other teams seen as customers?
- "customers" could mean having API versions that can be enabled gradually
- Did we keep in mind external users in the community?
- We shall not break tests and if we have to, we shall make it easy for other teams to decide by themselves when to fix them.
We covered the 4 points we care about ;-)

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by mpagot over 1 year ago

I'd like to suggest optional adoption of new API. One possible approach that come to my mind is

if(get_var('NEW_MULTILINE_CHECK') {
   die "Multiline commands are not supported:\n$cmd;\n" if $cmd =~ m/\n/;
}

So any team or JobGroup can OptIn to the new code. Code like this can be probably merged without to ask or wait anyone. After some time the code can be changed to more direct and "mandatory"

die "Multiline commands are not supported:\n$cmd;\n" if $cmd =~ m/\n/;

Actions

Copy link

Updated by bschmidt over 1 year ago

I would go further than what @mpagot suggests, and log the "would have been errors" to tackle them before going from optional to mandatory.

Actions

Copy link

Updated by tinita over 1 year ago

Target version set to Ready

Actions

Copy link

Updated by mgrifalconi over 1 year ago

Hello, from openQA reviewer's of that week perspective, I feel like the question should not be

Why wasn't it clear to everyone that this was going to break some tests?

but instead

Why the intention to merge something that was going to break some tests, instead of either fixing or asking test owners to fix tests to make sure anything that gets merged does not break?

I think I can be defined as a openQA stakeholder as both a core team developer and openQA reviewer and I think (of course looking at my interests, but I feel that aligns to customer's) that OSD shall be treated more as a production environment where 'keep it running' is the first and foremost priority.

One day of delay for update release is more important than 2-4 months delay to merge this PR IMHO, open to be corrected as I am not familiar with the real impact of such fix.

A bit of context:
We all know that openQA is one crucial tool to automate testing and approval of updates to our customers, we also know that the process (for various reasons) is not at its best and we happen to delay approvals for several days even when all tools work at their best.
Because of that, I don't like when we deliberately break things, distracting multiple people from multiple squads from working on what they planned to do and from working on problems that could not be easily avoided like the one we are talking about.

I consider distracting devs also the time they need to review an already prepared PR, not just fixing the real issue, due to the mindset 'You don't rush stuff in "production", not even fixes, if you can "easily" revert the problem instead.'

I understand that there was also a good dose of bad luck, as the easy revert was not easy in the end due to (AFAIK) having some experts in the topic on vacation (maybe a reason to delay the merge of non urgent breaking things?) and some datacenter/network work being done.
I also see that a super strict process to change stuff in openQA can be painful and in the end if no one takes action from a "well advertised warning" we need to break things to avoid being stuck forever.

But I am convinced we can meet somehow in the middle.

EDIT:
I really like the point of @mpagot ! Handling disruptive changes with an easy way out, especially if the revert is not that simple.
I think we had a similar discussion some months/years ago when the needle detecting algorithm was changed, causing the need of recreation of hundred of needles, in "emergency mode".
I see it's a multi step and more painful thing, but would avoid such situations, something like:

deployed, disabled, OPT-IN feature flag
deployed, enabled, OPT-OUT feature flag
deployed, enabled, flag removed With the possibility to avoid step 1 for simple situations.

I catch the occasion to apologize if I sounded "bothered" by this event, I want to clarify that have nothing against anyone involved, I am just passionate about the painful review process and (very) opinionated about the 'production' concept that I tried to describe on this comment.

I am happy to hear different opinions and also discuss that in a meeting if needed!

Actions

Copy link

Updated by okurz over 1 year ago

I agree with all points from mgrifalconi

@livdywan what I see what happened is that you removed my review in https://github.com/os-autoinst/os-autoinst/pull/2342#event-10171498200 which I would not have done. I would not have approved the PR due to the "this will break stuff" notice that the pull request clearly notes and I would have at least asked more people to review the potential impact and give them time to prepare fixes upfront or propose alternatives. So my suggestion for the future: Be more careful to remove reviews from others. Or in more simple terms: Do not remove reviews from others!

Actions

Copy link

Updated by MDoucha over 1 year ago

mpagot wrote in #note-1:

So any team or JobGroup can OptIn to the new code. Code like this can be probably merged without to ask or wait anyone. After some time the code can be changed to more direct and "mandatory"

That would simply result in everybody ignoring the code change until it becomes mandatory and then get caught by surprise just as much.

bschmidt wrote in #note-2:

I would go further than what @mpagot suggests, and log the "would have been errors" to tackle them before going from optional to mandatory.

If there was an easy option to do that, I'd do that. Do you have a specific suggestion how to find script_run() calls with multiline commands other than making them break?

Note that @dimstar identified a total of 14 broken tests (+2 more identified by others) and fixes for all of them were submitted within 24 hours.

Actions

Copy link

Updated by MDoucha over 1 year ago

mgrifalconi wrote in #note-4:

A bit of context:
We all know that openQA is one crucial tool to automate testing and approval of updates to our customers, we also know that the process (for various reasons) is not at its best and we happen to delay approvals for several days even when all tools work at their best.
Because of that, I don't like when we deliberately break things, distracting multiple people from multiple squads from working on what they planned to do and from working on problems that could not be easily avoided like the one we are talking about.

I consider distracting devs also the time they need to review an already prepared PR, not just fixing the real issue, due to the mindset 'You don't rush stuff in "production", not even fixes, if you can "easily" revert the problem instead.'

I understand that there was also a good dose of bad luck, as the easy revert was not easy in the end due to (AFAIK) having some experts in the topic on vacation (maybe a reason to delay the merge of non urgent breaking things?) and some datacenter/network work being done.
I also see that a super strict process to change stuff in openQA can be painful and in the end if no one takes action from a "well advertised warning" we need to break things to avoid being stuck forever.

But I am convinced we can meet somehow in the middle.

Let me also add some context that my PR was not just a change for the sake of change. We've been slowly transitioning tests from VNC console to serial terminal because it's an order of magnitude faster and more reliable. But multiline commands don't work properly on serial terminal because each newline will cause shell to print extra characters back and the check whether the command was echoed correctly will fail. It's a really confusing failure when you see that for the first time so most of our colleagues will waste many hours trying to figure out why it's suddenly failing. And let's just say that I've seen enough wrong attempts to fix those problems that would actually make things worse so I decided to eliminate this problem once and for all.

Actions

Copy link

Updated by acarvajal over 1 year ago

MDoucha wrote in #note-6:

If there was an easy option to do that, I'd do that. Do you have a specific suggestion how to find script_run() calls with multiline commands other than making them break?

record_soft_fail "poo#12345 - Multiline commands are not supported:\n$cmd;\n"; instead of die, leaving it run for some days and then querying the results?

Usually in packages we submit to our customers, whenever we're dropping an argument or some command, we keep it for some time with a message pointing out that the argument is soon to be deprecated. Why not apply the same to our test code?

Actions

Copy link

Updated by MDoucha over 1 year ago

okurz wrote in #note-5:

I would not have approved the PR due to the "this will break stuff" notice that the pull request clearly notes and I would have at least asked more people to review the potential impact and give them time to prepare fixes upfront or propose alternatives.

My PR was open for 43 days. Why did you not ask more people to review the impact during that time instead of asking for:

removing a blank line
changing a timeout constant to a parameter
bumping API version which would not have prevented any test failures on its own

Actions

Copy link

#10

Updated by favogt over 1 year ago

I'll argue what I've argued in previous cases like this before: Making such checks opt-in will just not work. Opt-out is the only way to address it.

In this particular case I think it would've been possible to scan the entire testsuite for cases where a newline is in a string literal. That should detect the vast majority of cases.

Actions

Copy link

#11

Updated by okurz over 1 year ago

MDoucha wrote in #note-9:

okurz wrote in #note-5:

I would not have approved the PR due to the "this will break stuff" notice that the pull request clearly notes and I would have at least asked more people to review the potential impact and give them time to prepare fixes upfront or propose alternatives.

My PR was open for 43 days. Why did you not ask more people to review the impact during that time instead of asking for [...]

I only ever commented on the PR in the first two days after opening while it had the "not-ready" label. I did not look into it the following weeks. If I would have stumbled upon the PR again I might have observed that not enough test responsible people have seen the pull request or actively commented on it. However, in general I consider 43 days enough. Everybody had their chance to take a look at the pull request. If they don't, well, that happens :)

favogt wrote in #note-10:

I'll argue what I've argued in previous cases like this before: Making such checks opt-in will just not work. Opt-out is the only way to address it.

I agree. Still, offering the multiple steps that Michael has proposed would not give anyone a reason to complain that they didn't have a chance ;)

acarvajal wrote in #note-8:

MDoucha wrote in #note-6:

If there was an easy option to do that, I'd do that. Do you have a specific suggestion how to find script_run() calls with multiline commands other than making them break?

record_soft_fail "poo#12345 - Multiline commands are not supported:\n$cmd;\n"; instead of die, leaving it run for some days and then querying the results?

Usually in packages we submit to our customers, whenever we're dropping an argument or some command, we keep it for some time with a message pointing out that the argument is soon to be deprecated. Why not apply the same to our test code?

No. record_soft_fail should only be used with product issues. You can use record_info or for messages in the logfile.

Actions

Copy link

#12

Updated by mgrifalconi over 1 year ago

MDoucha wrote in #note-7:

Let me also add some context that my PR was not just a change for the sake of change. [...]

Good morning, I never meant to question the validity or reasoning behind your change and I apologize if it sounded otherwise. All my observations are on the impact, deployment and its urgency.
I understand sometimes we need big changes to improve stuff we might take for granted and we need experts like you to develop the best solution possible to go forward even if it require some additional work by multiple people to have a better service in the long run!

I still question that disrupting multiple squads and delaying update approval (in a specific time were the situation was painful already) is not a decision for a single or a few individuals to take. Could have been a valid point of discussion in the POs QE-Sync meeting Wednesdays where all POs would agree on urgency and would allocate resources accordingly.

I see few comments about "making people aware of changes", I'd like to expand on that, since we might lack some common grounds on what "making all stakeolders aware" means, these is my view:
I would not expect all participants/stakeholders of QE to scan os-autinst; openqa; os-autoinst-distri-opensuse all the time, but not even every 6months. Feels a waste of time for dozen of people to read and understand changes that most of the times do not impact their work.
We have regular meetings to sync and talk about common topics and possibly the qe-sync meeting is a good choice.

This is part of a bigger topic that I started to mention a couple of years ago about ownership and areas of responsibility, where it should be well defined who (which squad/team) is responsible for a specific asset(code,server,service).
Then the owner shall be able to know who depends its products, to be able to communicate breaking changes or outages (direct communication, ping to POs, meetings = TCP; not a broadcast into the wind like a PR = UDP).
I see in the meantime we did move towards that direction by:

using squads instead of individuals in test maintainer description
moving to git job groups/test suites to find out about changes on openqa

That is just my personal view and again open to discuss and try to better elaborate maybe in a different ticket as I fear we might go too much of topic.

Actions

Copy link

#13

Updated by livdywan over 1 year ago

Status changed from New to Feedback
Assignee set to livdywan

That is just my personal view and again open to discuss and try to better elaborate maybe in a different ticket as I fear we might go too much of topic.

Ack. This right here is to collect feedback. We should have follow-up tickets for such ideas rather than trying to "solve it" here.

Let's also discuss this on the workshop on September 8. I'm adding it to the list of topics now.

Actions

Copy link

#14

Updated by livdywan over 1 year ago

Subject changed from Conduct lessons learned for making new-lines in script_run fatal to Conduct lessons learned for making new-lines in script_run fatal size:S

Actions

Copy link

#15

Updated by livdywan over 1 year ago

Description updated (diff)

The workshop went really well I think. Questions were refined and hypotheses as well as ideas for how we could've done better have been transferred to the ticket description.

I'm not resolving the ticket just yet. I'd like to give everyone a little more time to chime in, also because some people were not able to attend. In the meantime I'm going to look into follow-up tickets which we can explore.

Actions

Copy link

#16

Updated by livdywan over 1 year ago

By the way the recording is available internally. We opted to err on the safe side since we were very focussed on the internal SUSE perspectives.
^{^} If anyone has feedback from other communities outside of SUSE, please do feel free to share. We actually talked about communicating this better in release notes.

Actions

Copy link

#17

Updated by livdywan over 1 year ago

Trying to come up with a little best practice guide on handling cases like this in the future. Feedback welcome:

When proposing non-trivial changes with the potential of breaking existing tests consider the follow best practice patterns:

Make the problematic change opt-in via a test variable like MY_NEW_FEATURE_ENABLED to enable the new behavior, and otherwise log a warning only
If a BARK test is to be conducted to assess the full impact of the change an autoreview regex matching the most relevant error message should be prepared so that affected jobs can be restarted trivially without disrupting daily operation too much.
Inform all stakeholders in relevant Slack channels, Matrix and mailing lists
Include an explicit mention in the release notes

Actions

Copy link

#18

Updated by livdywan over 1 year ago

Copied to action #135662: Make the state of deployments (and rollbacks) more discoverable added

Actions

Copy link

#19

Updated by MDoucha over 1 year ago

One more idea for the future: The main problem with less fatal approaches to marking the buggy tests was that nobody would notice a log message or even a softfail. OpenQA lacks some kind of clearly visible non-fatal warning. Let's add a new function like record_code_warning() that'll behave similarly to record_soft_fail(). It'll add a result box to the test module with a warning icon ⚠️ and a pop-up description.

The result box will have no effect on module/job result value but it'll set a special "warning" flag on the module and job. Test modules with the warning flag will stay visible if you click "Show only failures" on the job detail page (even if the module passed). Jobs with the warning flag will have the warning icon on the test overview page right next to the usual BSC and POO icons.

Actions

Copy link

#20

Updated by bschmidt over 1 year ago

livdywan wrote in #note-17:

Make the problematic change opt-in via a test variable like MY_NEW_FEATURE_ENABLED to enable the new behavior, and otherwise log a warning only

If a BARK test is to be conducted to assess the full impact of the change an autoreview regex matching the most relevant error message should be prepared so that affected jobs can be restarted trivially without disrupting daily operation too much.

I'd like the message to include a link to the according github issue or PR.
That would add all the context for whoever sees the error message.

Actions

Copy link

#21

Updated by bschmidt over 1 year ago

MDoucha wrote in #note-19:

One more idea for the future: The main problem with less fatal approaches to marking the buggy tests was that nobody would notice a log message or even a softfail. OpenQA lacks some kind of clearly visible non-fatal warning. Let's add a new function like record_code_warning() that'll behave similarly to record_soft_fail(). It'll add a result box to the test module with a warning icon ⚠️ and a pop-up description.

The result box will have no effect on module/job result value but it'll set a special "warning" flag on the module and job. Test modules with the warning flag will stay visible if you click "Show only failures" on the job detail page (even if the module passed). Jobs with the warning flag will have the warning icon on the test overview page right next to the usual BSC and POO icons.

I really like this as well!!!
Code as documentation is always better than communication via release notes or any other channel.

Actions

Copy link

#22

Updated by acarvajal over 1 year ago

MDoucha wrote in #note-19:

One more idea for the future: The main problem with less fatal approaches to marking the buggy tests was that nobody would notice a log message or even a softfail. OpenQA lacks some kind of clearly visible non-fatal warning. Let's add a new function like record_code_warning() that'll behave similarly to record_soft_fail(). It'll add a result box to the test module with a warning icon ⚠️ and a pop-up description.

The result box will have no effect on module/job result value but it'll set a special "warning" flag on the module and job. Test modules with the warning flag will stay visible if you click "Show only failures" on the job detail page (even if the module passed). Jobs with the warning flag will have the warning icon on the test overview page right next to the usual BSC and POO icons.

I like this, great idea! If this can be made to also leave a result that's easy to query, better yet.

Actions

Copy link

#23

Updated by okurz over 1 year ago

Due date set to 2023-09-27
Status changed from Feedback to Workable

There is enough feedback by multiple people, please incorporate the ideas into #65271 – or separate tickets if you must :)

Actions

Copy link

#24

Updated by livdywan over 1 year ago

Copied to action #136127: Implement record warning feature similar to record softfail added

Actions

Copy link

#25

Updated by livdywan over 1 year ago

Status changed from Workable to Resolved

See https://progress.opensuse.org/projects/qa/wiki/Tools#Best-practices-for-major-changes

With this and the feature request based on Martin's suggestion I consider this resolved. Thank you everyone for your feedback!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #134723

Conduct lessons learned for making new-lines in script_run fatal size:S

Observation¶

Timeline¶

Questions¶

Acceptance criteria¶

Suggestions¶

Updated by mpagot over 1 year ago

Updated by bschmidt over 1 year ago

Updated by tinita over 1 year ago

Updated by mgrifalconi over 1 year ago

Updated by okurz over 1 year ago

Updated by MDoucha over 1 year ago

Updated by MDoucha over 1 year ago

Updated by acarvajal over 1 year ago

Updated by MDoucha over 1 year ago

Updated by favogt over 1 year ago

Updated by okurz over 1 year ago

Updated by mgrifalconi over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by MDoucha over 1 year ago

Updated by bschmidt over 1 year ago

Updated by bschmidt over 1 year ago

Updated by acarvajal over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago