coordination #105699
closedcoordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation
coordination #109656: [epic] Stable non-qemu backends
[epic] 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S
Description
Observation¶
See #105690
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
Five Whys¶
- Why...?
- ...
- Why...?
- ...stricter and be more likely to link alerts to the smaller deployed changes
- Why...?
- ...
- Why...?
- ...
- Why...?
- ...
Updated by livdywan almost 3 years ago
- Copied from action #105690: s390x svirt jobs incomplete with auto_review:"unable to extract assets:.*/var/lib/libvirt/images/a.img":retry added
Updated by livdywan almost 3 years ago
- Blocks action #104520: Move svirt extract_asset code from os-autoinst-distri-opensuse to os-autoinst/backend/svirt.pm size:M auto_review:"unable to extract assets: Can't call method.+name.+on an undefined value":retry added
Updated by mkittler almost 3 years ago
This is likely supposed to be discussed in a call. However, here are two questions I've been asking myself:
Why didn't I block any of the problematic PRs by requesting changes requiring a manual verification run?
It seemed a bit too much to ask for so I only suggested to have a look at https://github.com/Martchus/openQA-helper#testrun-svirt-backend-locally on the 2nd PR. Maybe we should be stricter when doing changes in important backends without automatic test coverage. I suppose there were even unit tests but they don't seem to be enough (that's why we have the fullstack test but it only covers the QEMU backend).
Why was svirt development not so problematic before (or was it)?
I suppose not many people dared to touch svrit code in the past. It was done more by test maintainers themselves which usually conduct manual verification runs. When I worked on the backend I tried to do the same, sometimes getting help from test maintainers. It was sometimes a lot of effort, though.
Updated by mkittler almost 3 years ago
- Assignee set to mkittler
One more question from my side to bring up in the retro:
- Why haven't we been able to create a better automatic fullstack test for the svirt backend? (I'm not saying that the test must necessarily be executed in a CI context. Having an automated test which runs locally utilizing hardware accessible via the internal network would already be a huge improvement.)
Updated by livdywan almost 3 years ago
Why did reaching out to experts not prevent this?
I made a point of reaching out to relevant maintainers before I even started looking into #104520, and keeping them in the loop. Arguably that gave me the confidence to even try. Except it doesn't look like it made much of a difference.
Updated by okurz almost 3 years ago
One important factor for myself is that I keep forgetting or overlooking certain things in reviews because I review in multiple different projects and for each I need to look for different things, e.g. in openQA I can usually trust the automated test assuming that there are according tests added and code coverage is fine, for os-autoinst I would need to carefully try myself or ask for tests, for openQA test distributions I would need to look for according openQA verification jobs. I suggest we look for "review checklists". I am thinking of a bot like mergify that responds to pull requests with a review checklist that we can tweak per project and then reviewers can cross off the individual points of that checklist.
Updated by livdywan almost 3 years ago
- Subject changed from 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" to 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz almost 3 years ago
- Tracker changed from action to coordination
- Subject changed from 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S to [epic] 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S
- Status changed from Workable to In Progress
Five Whys¶
- Why did we start the work on the svirt backend in the first place?
- The backend was never capable to fully execute tests without relying on some specific code in os-autoinst-distri-opensuse which was realized when working on another fix+feature from os-autoinst side. There was a contribution https://github.com/os-autoinst/os-autoinst/pull/1741 which had unexpected impact on the svirt backend causing tests to fail which we only realized after merging so the PR was reverted. And this is when we started to work on #104520 which is still open. So is https://github.com/os-autoinst/os-autoinst/pull/1898
Why didn't we block any of the problematic PRs by requesting changes requiring a manual verification run?
- It seems a bit too much to ask for so I only suggested to have a look at https://github.com/Martchus/openQA-helper#testrun-svirt-backend-locally on the 2nd PR. Maybe we should be stricter when doing changes in important backends without automatic test coverage. I suppose there were even unit tests but they don't seem to be enough (that's why we have the fullstack test but it only covers the QEMU backend). We were also not sure what actually could be tested here
-> 2-1: Ask for manual verification on review
=> okurz: Added a note to https://progress.opensuse.org/projects/qa/wiki/Tools#Definition-of-DONE
=> #106783-> 2-2: Simplify the setup of "local svirt environment" and include it officially in os-autoinst (not just "Martchus" special)
=> #106867-> 2-3: Also describe how to take out a production worker for testing
=> #106996-> 2-4: Experiment again using custom openQA backend commands so that we can even use the os-autoinst version from a PR in a container
=> #106999Why was svirt development not so problematic before (or was it)?
- Not many people dared to touch svirt code in the past. It was done more by test maintainers themselves which usually conduct manual verification runs. When mkittler worked on the backend he tried to do the same, sometimes getting help from test maintainers. It was sometimes a lot of effort, though. We should improve our workflow
-> 3-1: Potentially be more conservative regarding changing existing code (but rather introduce separate code)
Why haven't we been able to create a better automatic fullstack test for the svirt backend? (I'm not saying that the test must necessarily be executed in a CI context. Having an automated test which runs locally utilizing hardware accessible via the internal network would already be a huge improvement.)?
-> 4-1: It simply wasn't done which does not need to stop us :) okurz suggest to first have complete statement coverage even though it's limited, and then extend system level tests
Why did we not ask experts to test this before merge?
- Plans were reviewed, which is good. But this does not cover testing the implementation.
-> 5-1: We should explicitly ask domain experts and previous contributors to review and test
-> 5-2: One important factor for myself (okurz) is that I keep forgetting or overlooking certain things in reviews because I review in multiple different projects and for each I need to look for different things, e.g. in openQA I can usually trust the automated test assuming that there are according tests added and code coverage is fine, for os-autoinst I would need to carefully try myself or ask for tests, for openQA test distributions I would need to look for according openQA verification jobs. I suggest we look for "review checklists". I am thinking of a bot like mergify that responds to pull requests with a review checklist that we can tweak per project and then reviewers can cross off the individual points of that checklist.
=> #107005
Updated by mkittler almost 3 years ago
- Status changed from In Progress to Feedback
@okurz Thanks for summarizing the results. Is there anything left to do as part of this ticket? There are many suggestions but I'm not sure which ones should go into a follow-up ticket.
Updated by okurz almost 3 years ago
mkittler wrote:
@okurz Thanks for summarizing the results. Is there anything left to do as part of this ticket? There are many suggestions but I'm not sure which ones should go into a follow-up ticket.
If in doubt all of the suggestions should go into follow-up tickets, at best simple suggestions only into #65271 . If you can and want to, implement the suggestions directly, e.g. automatic checklists for review or something.
Updated by okurz almost 3 years ago
- Status changed from Feedback to In Progress
- Assignee changed from mkittler to okurz
As discussed in daily 2022-02-14 I will take over for follow-up
Updated by openqa_review almost 3 years ago
- Due date set to 2022-03-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz almost 3 years ago
- Category changed from Regressions/Crashes to Feature requests
- Status changed from In Progress to Blocked
Created according subtasks