action #167389: Conduct "lessons learned" with Five Why analysis for 2024-09-25 GRU git errors on openqa.opensuse.org - openQA Project (public) - openSUSE Project Management Tool

action #167389

## Motivation 
 2024-09-25, at least Tumbleweed x86_64 and aarch64 were significantly affected, see https://suse.slack.com/archives/C02CANHLANP/p1727237231866739 and https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$ZqnHKAJ3VZEn2R57ynsS7UozSKC63TK3BqtZGJZ2ERA 
 due to work on #164898 
 causing issues like incomplete https://openqa.opensuse.org/tests/4505393 
 with 

 ``` 
 Gru job failed Reason: Error detecting remote default branch name for "git@github.com:os-autoinst/os-autoinst-needles-opensuse.git":    "ssh -oBatchMode=yes": ssh -oBatchMode=yes: command not found fatal: Could not read from remote repository.  
 ``` 

 Quoting Dimstar 
 > I tell you though: It's scary to wake up, look at openQA and see a 100% test fail rate on a new snapshot 

 ## Acceptance criteria 
 * **AC1:** A [Five-Whys](https://en.wikipedia.org/wiki/Five_whys) analysis has been conducted and results documented 
 * **AC2:** Improvements are planned 

 ## Suggestions 
 * *DONE* Organize a call to conduct the 5 whys (not as part of the retro) 
 * Conduct "Five-Whys" analysis for the topic 
 * Identify follow-up tasks in tickets 


 ## What happened? 
 https://github.com/os-autoinst/openQA/pull/5910 was merged 2024-09-24 but had no immediate effect because of the feature switch that was introduced which is good. tinita on the evening of 2024-09-24 15:17Z edited openqa.ini with the setting `git_auto_update` but did not immediately restart the openqa-service and found a problem which was quickly fixed by https://github.com/os-autoinst/openQA/pull/5945 . At 2024-09-24 around 21:39Z a new Tumbleweed snapshot was triggered for testing causing multiple problems. At that time openQA was already reloaded/restarted. As visible in `sudo journalctl -u openqa-webui --since '2024-09-24'` we found that openQA was reloaded/restarted at 2024-09-24 16:39:31Z so roughly 1h after tinita's config changes. This triggered many incomplete jobs. In the meantime we have  

 ## Five Whys 
 * Q1. Why did jobs incomplete? 
   * A1. Because all jobs relied on one or multiple git GRU jobs which failed with "ssh -oBatchMode=yes: command not found" 

       * T1. Updating the git repository does not necessarily have to be fatal (but we still want to know about failures, at least when the periodic updates fail). No action planned. 

 * Q2. Why could this git command fail in production? 
   * A2. We relied on unit tests which mock the actual Git invocation. 

       * T2. We could mock the ssh command so that git would still be called. No action planned. 

 * Q3. Why did we not see this problem in before because https://github.com/os-autoinst/openQA/pull/5910 never changed that ssh invocation line? 
   * A3. We already had jobs using the git cloning/fetching in before. Only after enabling "git_auto_update" many more jobs use git cloning/fetching 

       * T3. Remind ourselves to better manually cover code that is only covered by unit tests with mocking 

 * Q4. Why did https://openqa.opensuse.org/group_overview/24 not show any related failures? 
   * A4. Because for openQA-in-openQA tests we don't use `git@...` ssh remotes (at least not for fetch). 

       * T4. Running actual ssh within openQA-in-openQA is obviously more easy than in unit tests. We would need a user ssh key though. No action planned. 

 * Q5. Why was the problem introduced in the source code? 
   * A5. The problem was always wrong in https://github.com/os-autoinst/openQA/pull/5622/files#diff-8d9512b891ef3e8199993cfe58fe429b5f863c49a245bf03cf59315969ff46deR55 
 even though in an intermediate refactoring in https://github.com/os-autoinst/openQA/pull/5863/files#diff-ae363652b8d591e88e0539738e5d7e4b79a8aeb70fd3942fc83631ef01863c19R36 the problem was fixed but that PR was reverted again in https://github.com/os-autoinst/openQA/pull/5880 and then the original problem was introduced again in https://github.com/os-autoinst/openQA/pull/5900/files#diff-ae363652b8d591e88e0539738e5d7e4b79a8aeb70fd3942fc83631ef01863c19R37 . Apparently that newly introduced code in PR 5622 was mocked away and never tested with git cloning on SSH URLs. And the quotes were likely re-introduced during refactoring to make tests pass as they only compare against expected command strings. The problem is that the quotes would only be necessary by a shell but no shell is ever called, only the Perl system command with a list containing the "env" command call with one argument. 

     * T5. Quote the individual arguments explicitly in the test mock so that the reference strings don't look like missing quotes for pseudo-shell commands -> #164898 


 ## Ideas 
 * I1. Fix warnings in log files 

 ``` 
 /var/log/openqa_gru.2.xz:[2024-09-19T15:47:33.908109Z] [warn] [#4329687] Local checkout at /var/lib/openqa/share/tests/opensuse has origin https://github.com/os-autoinst/os-autoinst-distri-opensuse.git but requesting to clone from https://github.com/$personal_fork/os-autoinst-distri-opensuse.git 
 ``` 
 for multiple personal forks but (only) one mention for os-autoinst-distri-openQA 

 ``` 
 /var/log/openqa_gru.1.xz:[2024-09-20T13:09:57.528757Z] [warn] [#4332550] Local checkout at /var/lib/openqa/share/tests/openqa has origin https://github.com/os-autoinst/os-autoinst-distri-openQA but requesting to clone from https://github.com/os-autoinst/os-autoinst-distri-openQA.git 
 ```

Back

Project

General

Profile

QA (public) » openQA Project (public)

action #167389