Project

General

Profile

action #115112

Conduct 5 Whys for "QEMU 6.2.0 assigns all CPUs to NUMA node 0 by default" size:M

Added by cdywan about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Organisational
Target version:
Start date:
2022-07-27
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

See #114739

Also the follow-up fix for other archs os-autoinst/os-autoinst/pull/2146

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Bring up in retro
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets
  • Organize a call to conduct the 5 whys (not as part of the retro)

History

#2 Updated by cdywan about 2 months ago

  • Category set to Organisational
  • Assignee set to cdywan

Let's do this Monday 11 CET

#3 Updated by cdywan about 1 month ago

  • Description updated (diff)

#4 Updated by cdywan about 1 month ago

  • Status changed from Workable to In Progress
  • Q1 Why are people unhappy?
    • The first pull request has 11 reviews (comments)
    • -> We provide quick and consistent feedback to all contributions so everyone should feel welcome to provide contributions and we will ensure that they are properly treated
    • Problem was triggered by SUSE QE Tools team triggering an OS upgrade of workers which includes a new version of qemu
    • QE Kernel are expected to be domain experts that see the problem first and understand the problem first. Likely are also the best candidates to fix the problem first? We know that QE Kernel guys are competent enough to fix that. Of course everybody states or can state that they are "too busy".
    • => A1-1 We should ask everybody to please clearly state the impact of issues to prevent misjudgement and incorrect application of prioritization. Ensure our ticket templates cover that
    • -> DONE Added in https://progress.opensuse.org/projects/openqav3/wiki/#Defects
    • The ticket was created with priority "High" and was resolved well within the period of 1 month (common expectation as SLO for SUSE QE teams) so no problem observed there.
    • The OS upgrade was conducted during a time when the team's capacity was reduced anyway, e.g. due to A/C failure in Nbg server rooms. An alternative would have been to wait until all A/C problems would have been fixed which likely would have been during a more critical time of SLE development, e.g. September/October during Alpha/Beta SLE15SP5 development which would have been a worse time. Also minor version OS upgrades should be ok to conduct at virtually any time (also see the answers to Q4 about this)
  • Q2 Why did the SUSE QE Tools team not just add the trivial fix immediately?
    • Of course everybody is busy with something. If somebody finds a seemingly "trivial fix" and also understanding the real requirements better then it's likely easier to also provide the fix in a pull request for the affected code then. Most members of SUSE QE Tools don't even know what "NUMA" is or what it is about :) In hindsight it turned out to be not a trivial fix that needed a patch-up anyway so it's unlikely that SUSE QE Tools would have performed better
    • => A2-1 Clarify more in our team's description that we can't be expected to be experts in everything and also that we are limited in what we can actually test
    • -> DONE Added in https://progress.opensuse.org/projects/qa/wiki/Tools#Out-of-scope
  • Q3 Can we expect that the SUSE QE Tools team prevents or fixes any problems only observable in openQA tests even if triggered by their actions?
    • SUSE QE tools team agrees that this can commonly not be expected and should not be expected. In theory we can find such cases e.g. with openqa-investigate when all investigation jobs fail then likely a problem in the infrastructure. But we currently do not have that as automatic evaluation and we consider it not something that people should expect to be present. We could have downgraded the deployment but it was not suggested by users and we within the SUSE QE Tools team were not made aware if the impact of the issue is enough to warrant a complete downgrade affecting everybody
    • => A3-1 Add to ticket templates, similar for "impact", a consideration for rollback
    • -> DONE Added in https://progress.opensuse.org/projects/openqav3/wiki/#Defects
    • A3-2 Add to our OS upgrade instructions a "rollback consideration" based on monitoring results and user feedback
    • -> DONE added in https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Distribution-upgrades
  • Q4 Why was such problem not discovered by package upgrade specific tests, e.g. within Tumbleweed or SLE/Leap 15.3 to 15.4, maybe in openQA tests?
    • This was a minor version upgrade, which would be expected to have low impact. The OS upgrade was "minor version", qemu was actually upgraded to a new major version. Also here likely it would have been possible to downgrade but not the complete OS but only the affected package(s), in this case qemu, e.g. force install the 15.3 qemu version. There has to be the realistic expectation that virtually at any time some component of a complex system can change and cause problems.
    • Does nobody run qemu virtual with custom numa during development of 15.4? In the past we often have caught problems in Tumbleweed before and we had openQA on Tumbleweed before and also o3 was upgraded first and no tester reported any problem
    • => A4-1 Report ticket about missing such validation tests to the corresponding test maintainers, Factory First wink-wink
  • Q5 Can we extend os-autoinst tests to discover such problems, e.g. when we update the OS base layer used within our CI?

#5 Updated by openqa_review about 1 month ago

  • Due date set to 2022-08-30

Setting due date based on mean cycle time of SUSE QE Tools

#7 Updated by cdywan about 1 month ago

  • Status changed from In Progress to Feedback

Having filed #115424 and updated Out of scope I think all actions are covered

#8 Updated by okurz about 1 month ago

  • Due date deleted (2022-08-30)
  • Status changed from Feedback to Resolved

I agree. This should cover all

Also available in: Atom PDF