action #139136
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M
0%
Description
Motivation¶
In #136130 the work was delayed multiple times and the due date was bumped already 4 times as of 2023-11-05. We should look into our execution of that ticket, learn what happened and find improvements for the future.
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- Bring up in retro
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
- Organize a call to conduct the 5 whys (not as part of the retro)
Updated by okurz about 1 year ago
- Copied from action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added
Updated by okurz almost 1 year ago
- Subject changed from Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" to Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz 12 months ago
https://suse.slack.com/archives/C02AJ1E568M/p1701087414573929
(Oliver Kurz) @channel can we do https://progress.opensuse.org/issues/139136 tomorrow after the daily infra?
Updated by okurz 11 months ago ยท Edited
- Status changed from Workable to In Progress
- Why did we have the original problem? -> Because we never had a proper definition or understanding or the implications of "hostname", "dns name", "interface name", "salt nodename", etc.
- Why did the ticket take so long? -> We needed to understand the symptoms and multi-machine test failures are still hard and people are afraid of them -> We assumed that initial changes and potential fixes would actually prevent the very much related problems and we did not proactively check if multi-machine tests actually work in that scenario because diesel+petrol were still disabled for cross-host multi-machine tests (single-host multi-machine tests were fine) -> Unfortunate chain of problems overshadowing the original problem that slowed us down and costed some time but probably just to be expected and acceptable -> plus hackweek
Why did we install the wrong vanilla kernel as workaround temporarily?
-> We just have to accept that there are situations where we need to run downgraded packages as OS bug workarounds but maybe our process for that is lacking.=> Unless we want to use SUSE Manager likely we should put all those downgrade handling in salt as we already do/did for other cases. So check what we currently have for handling in salt and extend that and document, reference in our process documentation => #152092
diesel had and still has another problem on top whereas mania+petrol are fine. Why did we struggle so much to distinguish all the different problems?
-> Because this problem involved multiple layers including at least salt, network, openQA tests. Maybe it's enough to be aware about problematic domains like the salt hostname/node area?=> We would again benefit from an easier reproducer. Related to #135818 . Come up with a way to ping over GRE tunnels and TAP devices and openvswitch outside a SUT with differing packet sizes => #152095
=> Learn more about openvswitch by experimenting together => #152098
Would diesel now work with the MTU related changes?
-> Yes, we should test that. We should first ensure that diesel is treated as tap worker regardless of not being used as production tap-worker=> Relax https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls?ref_type=heads#L38 to also match on "tap_poo1234" and document that this is how one can ensure a worker is configured for multi-machine tests but not for production jobs => #152101
Updated by okurz 11 months ago
- Copied to action #152092: Handle all package downgrades in OSD infrastructure properly in salt size:M added