action #167389: Conduct "lessons learned" with Five Why analysis for 2024-09-25 GRU git errors on openqa.opensuse.org - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #167389

closed

QA (public) - coordination #162890: [saga][epic] feature discoverability

coordination #162896: [epic] Job triggering on jobless openQA instances

Conduct "lessons learned" with Five Why analysis for 2024-09-25 GRU git errors on openqa.opensuse.org

Added by okurz 8 months ago. Updated 8 months ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Feature requests

Target version:

Ready

Start date:

2024-09-25

Due date:

% Done:

Estimated time:

Tags:

reactive work, lesson learned

Description

Motivation¶

2024-09-25, at least Tumbleweed x86_64 and aarch64 were significantly affected, see https://suse.slack.com/archives/C02CANHLANP/p1727237231866739 and https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$ZqnHKAJ3VZEn2R57ynsS7UozSKC63TK3BqtZGJZ2ERA
due to work on #164898
causing issues like incomplete https://openqa.opensuse.org/tests/4505393
with

Gru job failed Reason: Error detecting remote default branch name for "git@github.com:os-autoinst/os-autoinst-needles-opensuse.git":  "ssh -oBatchMode=yes": ssh -oBatchMode=yes: command not found fatal: Could not read from remote repository.

Quoting Dimstar

I tell you though: It's scary to wake up, look at openQA and see a 100% test fail rate on a new snapshot

Acceptance criteria¶

AC1: A Five-Whys analysis has been conducted and results documented
AC2: Improvements are planned

Suggestions¶

DONE Organize a call to conduct the 5 whys (not as part of the retro)
Conduct "Five-Whys" analysis for the topic
Identify follow-up tasks in tickets

What happened?¶

https://github.com/os-autoinst/openQA/pull/5910 was merged 2024-09-24 but had no immediate effect because of the feature switch that was introduced which is good. tinita on the evening of 2024-09-24 15:17Z edited openqa.ini with the setting git_auto_update but did not immediately restart the openqa-service and found a problem which was quickly fixed by https://github.com/os-autoinst/openQA/pull/5945 . At 2024-09-24 around 21:39Z a new Tumbleweed snapshot was triggered for testing causing multiple problems. At that time openQA was already reloaded/restarted. As visible in sudo journalctl -u openqa-webui --since '2024-09-24' we found that openQA was reloaded/restarted at 2024-09-24 16:39:31Z so roughly 1h after tinita's config changes. This triggered many incomplete jobs. In the meantime we have

Five Whys¶

Q1. Why did jobs incomplete?
- A1. Because all jobs relied on one or multiple git GRU jobs which failed with "ssh -oBatchMode=yes: command not found"
  - T1. Updating the git repository does not necessarily have to be fatal (but we still want to know about failures, at least when the periodic updates fail). No action planned.
Q2. Why could this git command fail in production?
- A2. We relied on unit tests which mock the actual Git invocation.
  - T2. We could mock the ssh command so that git would still be called. No action planned.
Q3. Why did we not see this problem in before because https://github.com/os-autoinst/openQA/pull/5910 never changed that ssh invocation line?
- A3. We already had jobs using the git cloning/fetching in before. Only after enabling "git_auto_update" many more jobs use git cloning/fetching
  - T3. Remind ourselves to better manually cover code that is only covered by unit tests with mocking
Q4. Why did https://openqa.opensuse.org/group_overview/24 not show any related failures?
- A4. Because for openQA-in-openQA tests we don't use git@... ssh remotes (at least not for fetch).
  - T4. Running actual ssh within openQA-in-openQA is obviously more easy than in unit tests. We would need a user ssh key though. No action planned.
Q5. Why was the problem introduced in the source code?
- A5. The problem was always wrong in https://github.com/os-autoinst/openQA/pull/5622/files#diff-8d9512b891ef3e8199993cfe58fe429b5f863c49a245bf03cf59315969ff46deR55
  even though in an intermediate refactoring in https://github.com/os-autoinst/openQA/pull/5863/files#diff-ae363652b8d591e88e0539738e5d7e4b79a8aeb70fd3942fc83631ef01863c19R36 the problem was fixed but that PR was reverted again in https://github.com/os-autoinst/openQA/pull/5880 and then the original problem was introduced again in https://github.com/os-autoinst/openQA/pull/5900/files#diff-ae363652b8d591e88e0539738e5d7e4b79a8aeb70fd3942fc83631ef01863c19R37 . Apparently that newly introduced code in PR 5622 was mocked away and never tested with git cloning on SSH URLs. And the quotes were likely re-introduced during refactoring to make tests pass as they only compare against expected command strings. The problem is that the quotes would only be necessary by a shell but no shell is ever called, only the Perl system command with a list containing the "env" command call with one argument.
  - T5. Quote the individual arguments explicitly in the test mock so that the reference strings don't look like missing quotes for pseudo-shell commands -> #164898

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 8 months ago

Copied from action #167335: Conduct "lessons learned" with Five Why analysis for GRU git cloning related errors added

Actions

Copy link

Updated by okurz 8 months ago

Description updated (diff)
Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #167389

Conduct "lessons learned" with Five Why analysis for 2024-09-25 GRU git errors on openqa.opensuse.org

Motivation¶

Acceptance criteria¶

Suggestions¶

What happened?¶

Five Whys¶

Updated by okurz 8 months ago

Updated by okurz 8 months ago