Project

General

Profile

Actions

action #175518

closed

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #175515: [epic] incomplete jobs with "Failed to find an available port: Address already in use"

Conduct "lessons learned" with Five Why analysis for "jobs incomplete with setup failure: isotovideo can not be started" size:S

Added by okurz about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-01-24
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • Organize a call to conduct the 5 whys (not as part of the retro)
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets

What happened?

  • Reported as #175464 there were a lot of failures across osd and o3 failing with setup failure: isotovideo can not be started. This was on all or almost all workers, not exactly the same time due to the deployment not being instantaneous and we eventually found a broken package was deployed.

Five Whys

  1. Why did we not realize this before the deployments?
    • This was a patched version of the perl-Mojo-IOLoop-ReadWriteProcess package in devel:openQA.
    • This was used by openQA-in-openQA.
    • The error was visible in openQA-in-openQA.
    • o3 deployments are not requiring openQA-in-openQA tests to pass.
    • OSD deployment was done due to #175485, fixed meanwhile.
  2. Why was the patch not tested outside devel:openQA first?
  3. Why did the CI checks fail in https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/61 ?
    • Unrelated existing failures stopped people looking at test results properly.
  4. Why did package checks in OBS not prevent the package from being published?
    • Because tests where passing. Maybe critical tests are not included?
  5. Why was it so much effort to get back to a working state?
    • The ticket was resolved within a day.
    • A team was assembled within a couple of hours of the original report.
    • Both o3+osd were impacted so more than just one infrastructure. OSD was additionally impacted by #175485
    • The package was deployed as part of devel:openQA and we don't have a package cache on all systems with the old package version unlike maintenance updates where older versions are available in official repositories.

What worked well

  1. W1: We formed a task force quickly and nicely collaborated relying on our processes for such situations.
  2. W2: After finding out which package update caused problems and understanding where the previous package version was from we could easily revert using zypper
  3. W3: We were able to make a link to actual development causing the regression quickly
  4. W4: Even though openQA-in-openQA tests couldn't be run properly due to o3 already being affected openQA-in-openQA still did not succeed as long as the regression persisted

Ideas

  1. I1: Make o3 deployments dependant on openQA-in-openQA tests passing.
    • Re-consider staging instances?
      • Are on demand staging instances quite similar to personal instances in practice? So little benefit going from past experience.
    • For more context also see the original "continuous delivery" ticket #18006 "continuous testing + delivery of tested openQA on openSUSE"
  2. I2: Preserve package cache on worker hosts.
    • Preserving history of all repos and all packages used on production machines
  3. I3: Ensure https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/ has 100% passing unit tests in CI
  4. I4: Explore tooling for checking of deployed versions and bisecting of problematic packages
    • git across multiple repos can tell us what is being used
    • salt can tell us what's installed on osd machines
      • Enhance the existing diff logic used for deployments and use it e.g. with investigations
    • How can deployed packages on o3 be verified?
    • Consider using snapshots on all machines
    • https://wiki.archlinux.org/title/Etckeeper
    • Use containers to ensure we know exactly what is deployed everywhere?

Related issues 1 (0 open1 closed)

Copied from openQA Project (public) - action #175464: jobs incomplete with auto_review:"setup failure: isotovideo can not be started"Resolvedokurz2025-01-15

Actions
Actions #1

Updated by okurz about 1 month ago

  • Copied from action #175464: jobs incomplete with auto_review:"setup failure: isotovideo can not be started" added
Actions #2

Updated by mkittler about 1 month ago

  • Subject changed from Conduct "lessons learned" with Five Why analysis for "jobs incomplete with setup failure: isotovideo can not be started" to Conduct "lessons learned" with Five Why analysis for "jobs incomplete with setup failure: isotovideo can not be started" size:S
  • Status changed from New to Workable
Actions #3

Updated by livdywan about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

Let's discuss the topic this afternoon

Actions #4

Updated by livdywan about 1 month ago

  • Description updated (diff)
Actions #5

Updated by openqa_review about 1 month ago

  • Due date set to 2025-02-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by livdywan about 1 month ago

  • Subtask #176139 added
Actions #7

Updated by livdywan about 1 month ago

  • Subtask #176142 added
Actions #8

Updated by livdywan about 1 month ago

  • Subtask #176145 added
Actions #9

Updated by livdywan about 1 month ago

  • Subtask #176148 added
Actions #10

Updated by livdywan about 1 month ago

  • Subtask #176151 added
Actions #11

Updated by livdywan about 1 month ago

  • Status changed from In Progress to Feedback
Actions #12

Updated by livdywan about 1 month ago

  • Subtask deleted (#176139)
Actions #13

Updated by livdywan about 1 month ago

  • Subtask deleted (#176142)
Actions #14

Updated by livdywan about 1 month ago

  • Subtask deleted (#176145)
Actions #15

Updated by livdywan about 1 month ago

  • Subtask deleted (#176148)
Actions #16

Updated by livdywan about 1 month ago

  • Subtask deleted (#176151)
Actions #17

Updated by livdywan about 1 month ago

  • Status changed from Feedback to Resolved

Follow-up tickets for the ideas filed

Actions

Also available in: Atom PDF