Project

General

Profile

Actions

action #139136

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M

Added by okurz 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Organisational
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

In #136130 the work was delayed multiple times and the due date was bumped already 4 times as of 2023-11-05. We should look into our execution of that ticket, learn what happened and find improvements for the future.

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 2 (0 open2 closed)

Copied from openQA Tests - action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:MResolvedmkittler2023-09-20

Actions
Copied to openQA Infrastructure - action #152092: Handle all package downgrades in OSD infrastructure properly in salt size:MResolvednicksinger2023-12-05

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added
Actions #2

Updated by okurz 6 months ago

  • Subject changed from Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" to Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz 6 months ago

  • Priority changed from Normal to High
Actions #4

Updated by okurz 5 months ago

  • Assignee set to okurz
Actions #5

Updated by okurz 5 months ago

Actions #6

Updated by okurz 5 months ago

Let's try again next week Tuesday

Actions #7

Updated by okurz 5 months ago

will do tomorrow, 1215Z

Actions #8

Updated by okurz 5 months ago ยท Edited

  • Status changed from Workable to In Progress
  1. Why did we have the original problem? -> Because we never had a proper definition or understanding or the implications of "hostname", "dns name", "interface name", "salt nodename", etc.
  2. Why did the ticket take so long? -> We needed to understand the symptoms and multi-machine test failures are still hard and people are afraid of them -> We assumed that initial changes and potential fixes would actually prevent the very much related problems and we did not proactively check if multi-machine tests actually work in that scenario because diesel+petrol were still disabled for cross-host multi-machine tests (single-host multi-machine tests were fine) -> Unfortunate chain of problems overshadowing the original problem that slowed us down and costed some time but probably just to be expected and acceptable -> plus hackweek
  3. Why did we install the wrong vanilla kernel as workaround temporarily?
    -> We just have to accept that there are situations where we need to run downgraded packages as OS bug workarounds but maybe our process for that is lacking.

    => Unless we want to use SUSE Manager likely we should put all those downgrade handling in salt as we already do/did for other cases. So check what we currently have for handling in salt and extend that and document, reference in our process documentation => #152092

  4. diesel had and still has another problem on top whereas mania+petrol are fine. Why did we struggle so much to distinguish all the different problems?
    -> Because this problem involved multiple layers including at least salt, network, openQA tests. Maybe it's enough to be aware about problematic domains like the salt hostname/node area?

    => We would again benefit from an easier reproducer. Related to #135818 . Come up with a way to ping over GRE tunnels and TAP devices and openvswitch outside a SUT with differing packet sizes => #152095

    => Learn more about openvswitch by experimenting together => #152098

  5. Would diesel now work with the MTU related changes?
    -> Yes, we should test that. We should first ensure that diesel is treated as tap worker regardless of not being used as production tap-worker

    => Relax https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls?ref_type=heads#L38 to also match on "tap_poo1234" and document that this is how one can ensure a worker is configured for multi-machine tests but not for production jobs => #152101

Actions #9

Updated by okurz 5 months ago

  • Parent task set to #111929
Actions #10

Updated by okurz 5 months ago

  • Copied to action #152092: Handle all package downgrades in OSD infrastructure properly in salt size:M added
Actions #11

Updated by okurz 5 months ago

  • Status changed from In Progress to Resolved

Lessons learned meeting conducted. Four follow-up tickets created. Done here.

Actions

Also available in: Atom PDF