action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #139136

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Organisational

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Motivation¶

In #136130 the work was delayed multiple times and the due date was bumped already 4 times as of 2023-11-05. We should look into our execution of that ticket, learn what happened and find improvements for the future.

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys (not as part of the retro)

Related issues 2 (0 open — 2 closed)

Copied from openQA Tests (public) - action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M

Resolved

mkittler

2023-09-20

Actions

Copied to openQA Infrastructure (public) - action #152092: Handle all package downgrades in OSD infrastructure properly in salt size:M

Resolved

nicksinger

2023-12-05

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 1 year ago

Copied from action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Subject changed from Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" to Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz over 1 year ago

Priority changed from Normal to High

Actions

Copy link

Updated by okurz over 1 year ago

Assignee set to okurz

Actions

Copy link

Updated by okurz over 1 year ago

https://suse.slack.com/archives/C02AJ1E568M/p1701087414573929

(Oliver Kurz) @channel can we do https://progress.opensuse.org/issues/139136 tomorrow after the daily infra?

Actions

Copy link

Updated by okurz over 1 year ago

Let's try again next week Tuesday

Actions

Copy link

Updated by okurz over 1 year ago

will do tomorrow, 1215Z

Actions

Copy link

Updated by okurz over 1 year ago · Edited

Status changed from Workable to In Progress

Why did we have the original problem? -> Because we never had a proper definition or understanding or the implications of "hostname", "dns name", "interface name", "salt nodename", etc.
Why did the ticket take so long? -> We needed to understand the symptoms and multi-machine test failures are still hard and people are afraid of them -> We assumed that initial changes and potential fixes would actually prevent the very much related problems and we did not proactively check if multi-machine tests actually work in that scenario because diesel+petrol were still disabled for cross-host multi-machine tests (single-host multi-machine tests were fine) -> Unfortunate chain of problems overshadowing the original problem that slowed us down and costed some time but probably just to be expected and acceptable -> plus hackweek
Why did we install the wrong vanilla kernel as workaround temporarily?
-> We just have to accept that there are situations where we need to run downgraded packages as OS bug workarounds but maybe our process for that is lacking.

=> Unless we want to use SUSE Manager likely we should put all those downgrade handling in salt as we already do/did for other cases. So check what we currently have for handling in salt and extend that and document, reference in our process documentation => #152092
diesel had and still has another problem on top whereas mania+petrol are fine. Why did we struggle so much to distinguish all the different problems?
-> Because this problem involved multiple layers including at least salt, network, openQA tests. Maybe it's enough to be aware about problematic domains like the salt hostname/node area?

=> We would again benefit from an easier reproducer. Related to #135818 . Come up with a way to ping over GRE tunnels and TAP devices and openvswitch outside a SUT with differing packet sizes => #152095

=> Learn more about openvswitch by experimenting together => #152098
Would diesel now work with the MTU related changes?
-> Yes, we should test that. We should first ensure that diesel is treated as tap worker regardless of not being used as production tap-worker

=> Relax https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls?ref_type=heads#L38 to also match on "tap_poo1234" and document that this is how one can ensure a worker is configured for multi-machine tests but not for production jobs => #152101