action #167389
closedQA (public) - coordination #162890: [saga][epic] feature discoverability
coordination #162896: [epic] Job triggering on jobless openQA instances
Conduct "lessons learned" with Five Why analysis for 2024-09-25 GRU git errors on openqa.opensuse.org
0%
Description
Motivation¶
2024-09-25, at least Tumbleweed x86_64 and aarch64 were significantly affected, see https://suse.slack.com/archives/C02CANHLANP/p1727237231866739 and https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$ZqnHKAJ3VZEn2R57ynsS7UozSKC63TK3BqtZGJZ2ERA
due to work on #164898
causing issues like incomplete https://openqa.opensuse.org/tests/4505393
with
Gru job failed Reason: Error detecting remote default branch name for "git@github.com:os-autoinst/os-autoinst-needles-opensuse.git": "ssh -oBatchMode=yes": ssh -oBatchMode=yes: command not found fatal: Could not read from remote repository.
Quoting Dimstar
I tell you though: It's scary to wake up, look at openQA and see a 100% test fail rate on a new snapshot
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- DONE Organize a call to conduct the 5 whys (not as part of the retro)
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
What happened?¶
https://github.com/os-autoinst/openQA/pull/5910 was merged 2024-09-24 but had no immediate effect because of the feature switch that was introduced which is good. tinita on the evening of 2024-09-24 15:17Z edited openqa.ini with the setting git_auto_update
but did not immediately restart the openqa-service and found a problem which was quickly fixed by https://github.com/os-autoinst/openQA/pull/5945 . At 2024-09-24 around 21:39Z a new Tumbleweed snapshot was triggered for testing causing multiple problems. At that time openQA was already reloaded/restarted. As visible in sudo journalctl -u openqa-webui --since '2024-09-24'
we found that openQA was reloaded/restarted at 2024-09-24 16:39:31Z so roughly 1h after tinita's config changes. This triggered many incomplete jobs. In the meantime we have
Five Whys¶
Q1. Why did jobs incomplete?
- A1. Because all jobs relied on one or multiple git GRU jobs which failed with "ssh -oBatchMode=yes: command not found"
- T1. Updating the git repository does not necessarily have to be fatal (but we still want to know about failures, at least when the periodic updates fail). No action planned.
- A1. Because all jobs relied on one or multiple git GRU jobs which failed with "ssh -oBatchMode=yes: command not found"
Q2. Why could this git command fail in production?
- A2. We relied on unit tests which mock the actual Git invocation.
- T2. We could mock the ssh command so that git would still be called. No action planned.
- A2. We relied on unit tests which mock the actual Git invocation.
Q3. Why did we not see this problem in before because https://github.com/os-autoinst/openQA/pull/5910 never changed that ssh invocation line?
- A3. We already had jobs using the git cloning/fetching in before. Only after enabling "git_auto_update" many more jobs use git cloning/fetching
- T3. Remind ourselves to better manually cover code that is only covered by unit tests with mocking
- A3. We already had jobs using the git cloning/fetching in before. Only after enabling "git_auto_update" many more jobs use git cloning/fetching
Q4. Why did https://openqa.opensuse.org/group_overview/24 not show any related failures?
- A4. Because for openQA-in-openQA tests we don't use
git@...
ssh remotes (at least not for fetch).- T4. Running actual ssh within openQA-in-openQA is obviously more easy than in unit tests. We would need a user ssh key though. No action planned.
- A4. Because for openQA-in-openQA tests we don't use
Q5. Why was the problem introduced in the source code?
- A5. The problem was always wrong in https://github.com/os-autoinst/openQA/pull/5622/files#diff-8d9512b891ef3e8199993cfe58fe429b5f863c49a245bf03cf59315969ff46deR55 even though in an intermediate refactoring in https://github.com/os-autoinst/openQA/pull/5863/files#diff-ae363652b8d591e88e0539738e5d7e4b79a8aeb70fd3942fc83631ef01863c19R36 the problem was fixed but that PR was reverted again in https://github.com/os-autoinst/openQA/pull/5880 and then the original problem was introduced again in https://github.com/os-autoinst/openQA/pull/5900/files#diff-ae363652b8d591e88e0539738e5d7e4b79a8aeb70fd3942fc83631ef01863c19R37 . Apparently that newly introduced code in PR 5622 was mocked away and never tested with git cloning on SSH URLs. And the quotes were likely re-introduced during refactoring to make tests pass as they only compare against expected command strings. The problem is that the quotes would only be necessary by a shell but no shell is ever called, only the Perl system command with a list containing the "env" command call with one argument.
- T5. Quote the individual arguments explicitly in the test mock so that the reference strings don't look like missing quotes for pseudo-shell commands -> #164898