action #139136
closed
coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M
Added by okurz about 1 year ago.
Updated 11 months ago.
Description
Motivation¶
In #136130 the work was delayed multiple times and the due date was bumped already 4 times as of 2023-11-05. We should look into our execution of that ticket, learn what happened and find improvements for the future.
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
- Copied from action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added
- Subject changed from Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" to Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Normal to High
Let's try again next week Tuesday
- Status changed from Workable to In Progress
- Why did we have the original problem?
-> Because we never had a proper definition or understanding or the implications of "hostname", "dns name", "interface name", "salt nodename", etc.
- Why did the ticket take so long?
-> We needed to understand the symptoms and multi-machine test failures are still hard and people are afraid of them
-> We assumed that initial changes and potential fixes would actually prevent the very much related problems and we did not proactively check if multi-machine tests actually work in that scenario because diesel+petrol were still disabled for cross-host multi-machine tests (single-host multi-machine tests were fine)
-> Unfortunate chain of problems overshadowing the original problem that slowed us down and costed some time but probably just to be expected and acceptable
-> plus hackweek
Why did we install the wrong vanilla kernel as workaround temporarily?
-> We just have to accept that there are situations where we need to run downgraded packages as OS bug workarounds but maybe our process for that is lacking.
=> Unless we want to use SUSE Manager likely we should put all those downgrade handling in salt as we already do/did for other cases. So check what we currently have for handling in salt and extend that and document, reference in our process documentation => #152092
diesel had and still has another problem on top whereas mania+petrol are fine. Why did we struggle so much to distinguish all the different problems?
-> Because this problem involved multiple layers including at least salt, network, openQA tests. Maybe it's enough to be aware about problematic domains like the salt hostname/node area?
=> We would again benefit from an easier reproducer. Related to #135818 . Come up with a way to ping over GRE tunnels and TAP devices and openvswitch outside a SUT with differing packet sizes => #152095
=> Learn more about openvswitch by experimenting together => #152098
Would diesel now work with the MTU related changes?
-> Yes, we should test that. We should first ensure that diesel is treated as tap worker regardless of not being used as production tap-worker
=> Relax https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls?ref_type=heads#L38 to also match on "tap_poo1234" and document that this is how one can ensure a worker is configured for multi-machine tests but not for production jobs => #152101
- Parent task set to #111929
- Copied to action #152092: Handle all package downgrades in OSD infrastructure properly in salt size:M added
- Status changed from In Progress to Resolved
Lessons learned meeting conducted. Four follow-up tickets created. Done here.
Also available in: Atom
PDF