Project

General

Profile

Actions

action #167389

closed

QA (public) - coordination #162890: [saga][epic] feature discoverability

coordination #162896: [epic] Job triggering on jobless openQA instances

Conduct "lessons learned" with Five Why analysis for 2024-09-25 GRU git errors on openqa.opensuse.org

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-09-25
Due date:
% Done:

0%

Estimated time:

Description

Motivation

2024-09-25, at least Tumbleweed x86_64 and aarch64 were significantly affected, see https://suse.slack.com/archives/C02CANHLANP/p1727237231866739 and https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$ZqnHKAJ3VZEn2R57ynsS7UozSKC63TK3BqtZGJZ2ERA
due to work on #164898
causing issues like incomplete https://openqa.opensuse.org/tests/4505393
with

Gru job failed Reason: Error detecting remote default branch name for "git@github.com:os-autoinst/os-autoinst-needles-opensuse.git":  "ssh -oBatchMode=yes": ssh -oBatchMode=yes: command not found fatal: Could not read from remote repository. 

Quoting Dimstar

I tell you though: It's scary to wake up, look at openQA and see a 100% test fail rate on a new snapshot

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • DONE Organize a call to conduct the 5 whys (not as part of the retro)
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets

What happened?

https://github.com/os-autoinst/openQA/pull/5910 was merged 2024-09-24 but had no immediate effect because of the feature switch that was introduced which is good. tinita on the evening of 2024-09-24 15:17Z edited openqa.ini with the setting git_auto_update but did not immediately restart the openqa-service and found a problem which was quickly fixed by https://github.com/os-autoinst/openQA/pull/5945 . At 2024-09-24 around 21:39Z a new Tumbleweed snapshot was triggered for testing causing multiple problems. At that time openQA was already reloaded/restarted. As visible in sudo journalctl -u openqa-webui --since '2024-09-24' we found that openQA was reloaded/restarted at 2024-09-24 16:39:31Z so roughly 1h after tinita's config changes. This triggered many incomplete jobs. In the meantime we have

Five Whys


Related issues 1 (0 open1 closed)

Copied from openQA Project (public) - action #167335: Conduct "lessons learned" with Five Why analysis for GRU git cloning related errorsResolvedokurz2024-09-25

Actions
Actions #1

Updated by okurz 3 months ago

  • Copied from action #167335: Conduct "lessons learned" with Five Why analysis for GRU git cloning related errors added
Actions #2

Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from New to Resolved
Actions

Also available in: Atom PDF