action #115112: Conduct 5 Whys for "QEMU 6.2.0 assigns all CPUs to NUMA node 0 by default" size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #115112

closed

Conduct 5 Whys for "QEMU 6.2.0 assigns all CPUs to NUMA node 0 by default" size:M

Added by livdywan almost 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Organisational

Target version:

Ready

Start date:

2022-07-27

Due date:

% Done:

Estimated time:

Description

Observation¶

See #114739

Also the follow-up fix for other archs os-autoinst/os-autoinst/pull/2146

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

Bring up in retro
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets
Organize a call to conduct the 5 whys (not as part of the retro)

Actions

Copy link

Updated by livdywan almost 3 years ago

Category set to Organisational
Assignee set to livdywan

Let's do this Monday 11 CET

Actions

Copy link

Updated by livdywan almost 3 years ago

Description updated (diff)

Actions

Copy link

Updated by livdywan almost 3 years ago

Status changed from Workable to In Progress

Q1 Why are people unhappy?
- The first pull request has 11 reviews (comments)
- -> We provide quick and consistent feedback to all contributions so everyone should feel welcome to provide contributions and we will ensure that they are properly treated
- Problem was triggered by SUSE QE Tools team triggering an OS upgrade of workers which includes a new version of qemu
- QE Kernel are expected to be domain experts that see the problem first and understand the problem first. Likely are also the best candidates to fix the problem first? We know that QE Kernel guys are competent enough to fix that. Of course everybody states or can state that they are "too busy".
- => A1-1 We should ask everybody to please clearly state the impact of issues to prevent misjudgement and incorrect application of prioritization. Ensure our ticket templates cover that
- -> DONE Added in https://progress.opensuse.org/projects/openqav3/wiki/#Defects
- The ticket was created with priority "High" and was resolved well within the period of 1 month (common expectation as SLO for SUSE QE teams) so no problem observed there.
- The OS upgrade was conducted during a time when the team's capacity was reduced anyway, e.g. due to A/C failure in Nbg server rooms. An alternative would have been to wait until all A/C problems would have been fixed which likely would have been during a more critical time of SLE development, e.g. September/October during Alpha/Beta SLE15SP5 development which would have been a worse time. Also minor version OS upgrades should be ok to conduct at virtually any time (also see the answers to Q4 about this)
Q2 Why did the SUSE QE Tools team not just add the trivial fix immediately?
- Of course everybody is busy with something. If somebody finds a seemingly "trivial fix" and also understanding the real requirements better then it's likely easier to also provide the fix in a pull request for the affected code then. Most members of SUSE QE Tools don't even know what "NUMA" is or what it is about :) In hindsight it turned out to be not a trivial fix that needed a patch-up anyway so it's unlikely that SUSE QE Tools would have performed better
- => A2-1 Clarify more in our team's description that we can't be expected to be experts in everything and also that we are limited in what we can actually test
- -> DONE Added in https://progress.opensuse.org/projects/qa/wiki/Tools#Out-of-scope
Q3 Can we expect that the SUSE QE Tools team prevents or fixes any problems only observable in openQA tests even if triggered by their actions?
- SUSE QE tools team agrees that this can commonly not be expected and should not be expected. In theory we can find such cases e.g. with openqa-investigate when all investigation jobs fail then likely a problem in the infrastructure. But we currently do not have that as automatic evaluation and we consider it not something that people should expect to be present. We could have downgraded the deployment but it was not suggested by users and we within the SUSE QE Tools team were not made aware if the impact of the issue is enough to warrant a complete downgrade affecting everybody
- => A3-1 Add to ticket templates, similar for "impact", a consideration for rollback
- -> DONE Added in https://progress.opensuse.org/projects/openqav3/wiki/#Defects
- A3-2 Add to our OS upgrade instructions a "rollback consideration" based on monitoring results and user feedback
- -> DONE added in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Distribution-upgrades
Q4 Why was such problem not discovered by package upgrade specific tests, e.g. within Tumbleweed or SLE/Leap 15.3 to 15.4, maybe in openQA tests?
- This was a minor version upgrade, which would be expected to have low impact. The OS upgrade was "minor version", qemu was actually upgraded to a new major version. Also here likely it would have been possible to downgrade but not the complete OS but only the affected package(s), in this case qemu, e.g. force install the 15.3 qemu version. There has to be the realistic expectation that virtually at any time some component of a complex system can change and cause problems.
- Does nobody run qemu virtual with custom numa during development of 15.4? In the past we often have caught problems in Tumbleweed before and we had openQA on Tumbleweed before and also o3 was upgraded first and no tester reported any problem
- => A4-1 Report ticket about missing such validation tests to the corresponding test maintainers, Factory First wink-wink
Q5 Can we extend os-autoinst tests to discover such problems, e.g. when we update the OS base layer used within our CI?
- The line in question in the fix https://github.com/os-autoinst/os-autoinst/pull/2140/files#diff-675dc99664e3cb2e63629d86dd587727a186671073760a12af76d2d974a692a0R883 would likely only be covered by unit tests that check that the right parameters are applied to qemu, not how qemu evaluates them. #109740 already has some plans to extend unit tests but they would not have helped us in this specific case. We could consider more integration level tests but even that is not likely to cover changes in behaviour of qemu NUMA parameters which is also not within the scope of os-autoinst development.