action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #139136

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Organisational

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Motivation¶

In #136130 the work was delayed multiple times and the due date was bumped already 4 times as of 2023-11-05. We should look into our execution of that ticket, learn what happened and find improvements for the future.

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Copied from action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Subject changed from Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" to Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz over 1 year ago

Priority changed from Normal to High

Actions

Copy link

Updated by okurz over 1 year ago

Assignee set to okurz

Actions

Copy link

Updated by okurz over 1 year ago

https://suse.slack.com/archives/C02AJ1E568M/p1701087414573929

(Oliver Kurz) @channel can we do https://progress.opensuse.org/issues/139136 tomorrow after the daily infra?

Actions

Copy link

Updated by okurz over 1 year ago

Let's try again next week Tuesday

Actions

Copy link

Updated by okurz over 1 year ago

will do tomorrow, 1215Z

Actions

Copy link

Updated by okurz over 1 year ago · Edited

Status changed from Workable to In Progress

Why did we have the original problem?
-> Because we never had a proper definition or understanding or the implications of "hostname", "dns name", "interface name", "salt nodename", etc.
Why did the ticket take so long?
-> We needed to understand the symptoms and multi-machine test failures are still hard and people are afraid of them
-> We assumed that initial changes and potential fixes would actually prevent the very much related problems and we did not proactively check if multi-machine tests actually work in that scenario because diesel+petrol were still disabled for cross-host multi-machine tests (single-host multi-machine tests were fine)
-> Unfortunate chain of problems overshadowing the original problem that slowed us down and costed some time but probably just to be expected and acceptable
-> plus hackweek
Why did we install the wrong vanilla kernel as workaround temporarily?
-> We just have to accept that there are situations where we need to run downgraded packages as OS bug workarounds but maybe our process for that is lacking.

=> Unless we want to use SUSE Manager likely we should put all those downgrade handling in salt as we already do/did for other cases. So check what we currently have for handling in salt and extend that and document, reference in our process documentation => #152092
diesel had and still has another problem on top whereas mania+petrol are fine. Why did we struggle so much to distinguish all the different problems?
-> Because this problem involved multiple layers including at least salt, network, openQA tests. Maybe it's enough to be aware about problematic domains like the salt hostname/node area?

=> We would again benefit from an easier reproducer. Related to #135818 . Come up with a way to ping over GRE tunnels and TAP devices and openvswitch outside a SUT with differing packet sizes => #152095

=> Learn more about openvswitch by experimenting together => #152098
Would diesel now work with the MTU related changes?
-> Yes, we should test that. We should first ensure that diesel is treated as tap worker regardless of not being used as production tap-worker

=> Relax https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls?ref_type=heads#L38 to also match on "tap_poo1234" and document that this is how one can ensure a worker is configured for multi-machine tests but not for production jobs => #152101