Project

General

Profile

Actions

coordination #105699

closed

coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

coordination #109656: [epic] Stable non-qemu backends

[epic] 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S

Added by livdywan about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-12-29
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

See #105690

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Five Whys

  1. Why...?
    • ...
  2. Why...?
    • ...stricter and be more likely to link alerts to the smaller deployed changes
  3. Why...?
    • ...
  4. Why...?
    • ...
  5. Why...?
    • ...

Subtasks 7 (0 open7 closed)

action #106783: Cover manual testing steps, especially for "exotic" backends, in os-autoinst contribution hints size:MResolvedmkittler2022-02-14

Actions
action #106867: os-autoinst: local svirt testing instructions size:MResolvedmkittler2022-02-14

Actions
action #106996: os-autoinst: describe how to take out a production worker instance for testing backend changes size:MResolvedmkittler2022-02-14

Actions
action #106999: os-autoinst: Document the use of custom openQA backend commands to test os-autoinst changes on production workers size:MResolvedokurz2022-02-14

Actions
action #107005: Automatic review checklists on pull requests, especially for os-autoinst non-qemu backend testsResolvedokurz2022-02-14

Actions
openQA Infrastructure - action #108401: Can't call method "id" on an undefined value at V1/Job.pmResolvedtinita2021-12-29

Actions
action #108662: Can't call method "id" on an undefined value at WebAPI/Controller/Test.pmResolvedokurz2021-12-29

Actions

Related issues 1 (0 open1 closed)

Copied from openQA Project - action #105690: s390x svirt jobs incomplete with auto_review:"unable to extract assets:.*/var/lib/libvirt/images/a.img":retryResolvedokurz2022-01-28

Actions
Actions #1

Updated by livdywan about 2 years ago

  • Copied from action #105690: s390x svirt jobs incomplete with auto_review:"unable to extract assets:.*/var/lib/libvirt/images/a.img":retry added
Actions #2

Updated by okurz about 2 years ago

  • Priority changed from Normal to High
Actions #3

Updated by livdywan about 2 years ago

  • Blocks action #104520: Move svirt extract_asset code from os-autoinst-distri-opensuse to os-autoinst/backend/svirt.pm size:M auto_review:"unable to extract assets: Can't call method.+name.+on an undefined value":retry added
Actions #4

Updated by mkittler about 2 years ago

This is likely supposed to be discussed in a call. However, here are two questions I've been asking myself:


Why didn't I block any of the problematic PRs by requesting changes requiring a manual verification run?

It seemed a bit too much to ask for so I only suggested to have a look at https://github.com/Martchus/openQA-helper#testrun-svirt-backend-locally on the 2nd PR. Maybe we should be stricter when doing changes in important backends without automatic test coverage. I suppose there were even unit tests but they don't seem to be enough (that's why we have the fullstack test but it only covers the QEMU backend).


Why was svirt development not so problematic before (or was it)?

I suppose not many people dared to touch svrit code in the past. It was done more by test maintainers themselves which usually conduct manual verification runs. When I worked on the backend I tried to do the same, sometimes getting help from test maintainers. It was sometimes a lot of effort, though.

Actions #5

Updated by mkittler about 2 years ago

  • Assignee set to mkittler

One more question from my side to bring up in the retro:

  • Why haven't we been able to create a better automatic fullstack test for the svirt backend? (I'm not saying that the test must necessarily be executed in a CI context. Having an automated test which runs locally utilizing hardware accessible via the internal network would already be a huge improvement.)
Actions #6

Updated by livdywan about 2 years ago

Why did reaching out to experts not prevent this?

I made a point of reaching out to relevant maintainers before I even started looking into #104520, and keeping them in the loop. Arguably that gave me the confidence to even try. Except it doesn't look like it made much of a difference.

Actions #7

Updated by okurz about 2 years ago

One important factor for myself is that I keep forgetting or overlooking certain things in reviews because I review in multiple different projects and for each I need to look for different things, e.g. in openQA I can usually trust the automated test assuming that there are according tests added and code coverage is fine, for os-autoinst I would need to carefully try myself or ask for tests, for openQA test distributions I would need to look for according openQA verification jobs. I suggest we look for "review checklists". I am thinking of a bot like mergify that responds to pull requests with a review checklist that we can tweak per project and then reviewers can cross off the individual points of that checklist.

Actions #8

Updated by livdywan about 2 years ago

  • Subject changed from 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" to 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #9

Updated by okurz about 2 years ago

  • Tracker changed from action to coordination
  • Subject changed from 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S to [epic] 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S
  • Status changed from Workable to In Progress

Five Whys

  1. Why did we start the work on the svirt backend in the first place?
  2. Why didn't we block any of the problematic PRs by requesting changes requiring a manual verification run?

    • It seems a bit too much to ask for so I only suggested to have a look at https://github.com/Martchus/openQA-helper#testrun-svirt-backend-locally on the 2nd PR. Maybe we should be stricter when doing changes in important backends without automatic test coverage. I suppose there were even unit tests but they don't seem to be enough (that's why we have the fullstack test but it only covers the QEMU backend). We were also not sure what actually could be tested here

    -> 2-1: Ask for manual verification on review
    => okurz: Added a note to https://progress.opensuse.org/projects/qa/wiki/Tools#Definition-of-DONE
    => #106783

    -> 2-2: Simplify the setup of "local svirt environment" and include it officially in os-autoinst (not just "Martchus" special)
    => #106867

    -> 2-3: Also describe how to take out a production worker for testing
    => #106996

    -> 2-4: Experiment again using custom openQA backend commands so that we can even use the os-autoinst version from a PR in a container
    => #106999

  3. Why was svirt development not so problematic before (or was it)?

    • Not many people dared to touch svirt code in the past. It was done more by test maintainers themselves which usually conduct manual verification runs. When mkittler worked on the backend he tried to do the same, sometimes getting help from test maintainers. It was sometimes a lot of effort, though. We should improve our workflow

    -> 3-1: Potentially be more conservative regarding changing existing code (but rather introduce separate code)

  4. Why haven't we been able to create a better automatic fullstack test for the svirt backend? (I'm not saying that the test must necessarily be executed in a CI context. Having an automated test which runs locally utilizing hardware accessible via the internal network would already be a huge improvement.)?

    -> 4-1: It simply wasn't done which does not need to stop us :) okurz suggest to first have complete statement coverage even though it's limited, and then extend system level tests

  5. Why did we not ask experts to test this before merge?

    • Plans were reviewed, which is good. But this does not cover testing the implementation.

    -> 5-1: We should explicitly ask domain experts and previous contributors to review and test

    -> 5-2: One important factor for myself (okurz) is that I keep forgetting or overlooking certain things in reviews because I review in multiple different projects and for each I need to look for different things, e.g. in openQA I can usually trust the automated test assuming that there are according tests added and code coverage is fine, for os-autoinst I would need to carefully try myself or ask for tests, for openQA test distributions I would need to look for according openQA verification jobs. I suggest we look for "review checklists". I am thinking of a bot like mergify that responds to pull requests with a review checklist that we can tweak per project and then reviewers can cross off the individual points of that checklist.
    => #107005

Actions #10

Updated by mkittler about 2 years ago

  • Status changed from In Progress to Feedback

@okurz Thanks for summarizing the results. Is there anything left to do as part of this ticket? There are many suggestions but I'm not sure which ones should go into a follow-up ticket.

Actions #11

Updated by okurz about 2 years ago

mkittler wrote:

@okurz Thanks for summarizing the results. Is there anything left to do as part of this ticket? There are many suggestions but I'm not sure which ones should go into a follow-up ticket.

If in doubt all of the suggestions should go into follow-up tickets, at best simple suggestions only into #65271 . If you can and want to, implement the suggestions directly, e.g. automatic checklists for review or something.

Actions #12

Updated by okurz about 2 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from mkittler to okurz

As discussed in daily 2022-02-14 I will take over for follow-up

Actions #13

Updated by openqa_review about 2 years ago

  • Due date set to 2022-03-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by okurz about 2 years ago

  • Category changed from Regressions/Crashes to Feature requests
  • Status changed from In Progress to Blocked

Created according subtasks

Actions #15

Updated by okurz about 2 years ago

  • Parent task set to #109656
Actions #16

Updated by okurz almost 2 years ago

  • Status changed from Blocked to Resolved
Actions

Also available in: Atom PDF