Project

General

Profile

coordination #105699

coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

coordination #109656: [epic] Stable non-qemu backends

[epic] 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S

Added by cdywan 4 months ago. Updated 13 days ago.

Status:
Blocked
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-12-29
Due date:
% Done:

75%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Observation

See #105690

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Five Whys

  1. Why...?
    • ...
  2. Why...?
    • ...stricter and be more likely to link alerts to the smaller deployed changes
  3. Why...?
    • ...
  4. Why...?
    • ...
  5. Why...?
    • ...

Subtasks

action #104520: Move svirt extract_asset code from os-autoinst-distri-opensuse to os-autoinst/backend/svirt.pm size:M auto_review:"unable to extract assets: Can't call method.+name.+on an undefined value":retryWorkable

action #106783: Cover manual testing steps, especially for "exotic" backends, in os-autoinst contribution hints size:MWorkable

action #106867: os-autoinst: local svirt testing instructions size:MResolvedmkittler

action #106996: os-autoinst: describe how to take out a production worker instance for testing backend changes size:MResolvedmkittler

action #106999: os-autoinst: Document the use of custom openQA backend commands to test os-autoinst changes on production workers size:MResolvedokurz

action #107005: Automatic review checklists on pull requests, especially for os-autoinst non-qemu backend testsResolvedokurz

openQA Infrastructure - action #108401: Can't call method "id" on an undefined value at V1/Job.pmResolvedtinita

action #108662: Can't call method "id" on an undefined value at WebAPI/Controller/Test.pmResolvedokurz


Related issues

Copied from openQA Project - action #105690: s390x svirt jobs incomplete with auto_review:"unable to extract assets:.*/var/lib/libvirt/images/a.img":retryResolved2022-01-28

History

#1 Updated by cdywan 4 months ago

  • Copied from action #105690: s390x svirt jobs incomplete with auto_review:"unable to extract assets:.*/var/lib/libvirt/images/a.img":retry added

#2 Updated by okurz 4 months ago

  • Priority changed from Normal to High

#3 Updated by cdywan 4 months ago

  • Blocks action #104520: Move svirt extract_asset code from os-autoinst-distri-opensuse to os-autoinst/backend/svirt.pm size:M auto_review:"unable to extract assets: Can't call method.+name.+on an undefined value":retry added

#4 Updated by mkittler 3 months ago

This is likely supposed to be discussed in a call. However, here are two questions I've been asking myself:


Why didn't I block any of the problematic PRs by requesting changes requiring a manual verification run?

It seemed a bit too much to ask for so I only suggested to have a look at https://github.com/Martchus/openQA-helper#testrun-svirt-backend-locally on the 2nd PR. Maybe we should be stricter when doing changes in important backends without automatic test coverage. I suppose there were even unit tests but they don't seem to be enough (that's why we have the fullstack test but it only covers the QEMU backend).


Why was svirt development not so problematic before (or was it)?

I suppose not many people dared to touch svrit code in the past. It was done more by test maintainers themselves which usually conduct manual verification runs. When I worked on the backend I tried to do the same, sometimes getting help from test maintainers. It was sometimes a lot of effort, though.

#5 Updated by mkittler 3 months ago

  • Assignee set to mkittler

One more question from my side to bring up in the retro:

  • Why haven't we been able to create a better automatic fullstack test for the svirt backend? (I'm not saying that the test must necessarily be executed in a CI context. Having an automated test which runs locally utilizing hardware accessible via the internal network would already be a huge improvement.)

#6 Updated by cdywan 3 months ago

Why did reaching out to experts not prevent this?

I made a point of reaching out to relevant maintainers before I even started looking into #104520, and keeping them in the loop. Arguably that gave me the confidence to even try. Except it doesn't look like it made much of a difference.

#7 Updated by okurz 3 months ago

One important factor for myself is that I keep forgetting or overlooking certain things in reviews because I review in multiple different projects and for each I need to look for different things, e.g. in openQA I can usually trust the automated test assuming that there are according tests added and code coverage is fine, for os-autoinst I would need to carefully try myself or ask for tests, for openQA test distributions I would need to look for according openQA verification jobs. I suggest we look for "review checklists". I am thinking of a bot like mergify that responds to pull requests with a review checklist that we can tweak per project and then reviewers can cross off the individual points of that checklist.

#8 Updated by cdywan 3 months ago

  • Subject changed from 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" to 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S
  • Description updated (diff)
  • Status changed from New to Workable

#9 Updated by okurz 3 months ago

  • Tracker changed from action to coordination
  • Subject changed from 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S to [epic] 5 whys follow-up to s390x svirt jobs incomplete with unable to extract assets:.*/var/lib/libvirt/images/a.img" size:S
  • Status changed from Workable to In Progress

Five Whys

  1. Why did we start the work on the svirt backend in the first place?
  2. Why didn't we block any of the problematic PRs by requesting changes requiring a manual verification run?

    • It seems a bit too much to ask for so I only suggested to have a look at https://github.com/Martchus/openQA-helper#testrun-svirt-backend-locally on the 2nd PR. Maybe we should be stricter when doing changes in important backends without automatic test coverage. I suppose there were even unit tests but they don't seem to be enough (that's why we have the fullstack test but it only covers the QEMU backend). We were also not sure what actually could be tested here

    -> 2-1: Ask for manual verification on review
    => okurz: Added a note to https://progress.opensuse.org/projects/qa/wiki/Tools#Definition-of-DONE
    => #106783

    -> 2-2: Simplify the setup of "local svirt environment" and include it officially in os-autoinst (not just "Martchus" special)
    => #106867

    -> 2-3: Also describe how to take out a production worker for testing
    => #106996

    -> 2-4: Experiment again using custom openQA backend commands so that we can even use the os-autoinst version from a PR in a container
    => #106999

  3. Why was svirt development not so problematic before (or was it)?

    • Not many people dared to touch svirt code in the past. It was done more by test maintainers themselves which usually conduct manual verification runs. When mkittler worked on the backend he tried to do the same, sometimes getting help from test maintainers. It was sometimes a lot of effort, though. We should improve our workflow

    -> 3-1: Potentially be more conservative regarding changing existing code (but rather introduce separate code)

  4. Why haven't we been able to create a better automatic fullstack test for the svirt backend? (I'm not saying that the test must necessarily be executed in a CI context. Having an automated test which runs locally utilizing hardware accessible via the internal network would already be a huge improvement.)?

    -> 4-1: It simply wasn't done which does not need to stop us :) okurz suggest to first have complete statement coverage even though it's limited, and then extend system level tests

  5. Why did we not ask experts to test this before merge?

    • Plans were reviewed, which is good. But this does not cover testing the implementation.

    -> 5-1: We should explicitly ask domain experts and previous contributors to review and test

    -> 5-2: One important factor for myself (okurz) is that I keep forgetting or overlooking certain things in reviews because I review in multiple different projects and for each I need to look for different things, e.g. in openQA I can usually trust the automated test assuming that there are according tests added and code coverage is fine, for os-autoinst I would need to carefully try myself or ask for tests, for openQA test distributions I would need to look for according openQA verification jobs. I suggest we look for "review checklists". I am thinking of a bot like mergify that responds to pull requests with a review checklist that we can tweak per project and then reviewers can cross off the individual points of that checklist.
    => #107005

#10 Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

okurz Thanks for summarizing the results. Is there anything left to do as part of this ticket? There are many suggestions but I'm not sure which ones should go into a follow-up ticket.

#11 Updated by okurz 3 months ago

mkittler wrote:

okurz Thanks for summarizing the results. Is there anything left to do as part of this ticket? There are many suggestions but I'm not sure which ones should go into a follow-up ticket.

If in doubt all of the suggestions should go into follow-up tickets, at best simple suggestions only into #65271 . If you can and want to, implement the suggestions directly, e.g. automatic checklists for review or something.

#12 Updated by okurz 3 months ago

  • Status changed from Feedback to In Progress
  • Assignee changed from mkittler to okurz

As discussed in daily 2022-02-14 I will take over for follow-up

#13 Updated by openqa_review 3 months ago

  • Due date set to 2022-03-01

Setting due date based on mean cycle time of SUSE QE Tools

#14 Updated by okurz 3 months ago

  • Category changed from Concrete Bugs to Feature requests
  • Status changed from In Progress to Blocked

Created according subtasks

#15 Updated by okurz about 1 month ago

  • Parent task set to #109656

Also available in: Atom PDF