Project

General

Profile

Actions

action #167335

closed

QA (public) - coordination #162890: [saga][epic] feature discoverability

coordination #162896: [epic] Job triggering on jobless openQA instances

Conduct "lessons learned" with Five Why analysis for GRU git cloning related errors

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-09-25
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Apparently a problem related to either #166658 or #164898 showed up in Tumbleweed openQA-in-openQA tests, reported in https://bugzilla.suse.com/show_bug.cgi?id=1230953 , but not devel:openQA based openQA-in-openQA tests in https://openqa.opensuse.org/group_overview/24 . The next morning, 2024-09-25, at least Tumbleweed x86_64 and aarch64 were significantly affected, see https://suse.slack.com/archives/C02CANHLANP/p1727237231866739 and https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$ZqnHKAJ3VZEn2R57ynsS7UozSKC63TK3BqtZGJZ2ERA

Quoting Dimstar

I tell you though: It's scary to wake up, look at openQA and see a 100% test fail rate on a new snapshot

Acceptance criteria

  • AC1: A Five-Whys analysis has been conducted and results documented
  • AC2: Improvements are planned

Suggestions

  • DONE Organize a call to conduct the 5 whys (not as part of the retro)
  • Conduct "Five-Whys" analysis for the topic
  • Identify follow-up tasks in tickets

What happened?

https://github.com/os-autoinst/openQA/pull/5940 "Fix initial cloning via fetchneedles after 313ee7a1" was created due to https://openqa.opensuse.org/tests/4500449#step/test_distribution/3 failing in openQA-in-openQA tests. So we observed a problem after merge to master, caught in openQA-in-openQA tests based on packages in devel:openQA. The PR was merged 2024-09-23 13:18Z. https://build.opensuse.org/request/show/1202651 was created 2024-09-23 12:51Z so before the fix. The SR was likely created by http://jenkins.qa.suse.de/job/submit-openQA-TW-to-oS_Fctry/1092/ triggered 2024-09-23 12:41Z which itself was triggered by a build within http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/, probably http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/26052/ which was triggered by
http://jenkins.qa.suse.de/job/trigger-openQA_in_openQA-TW/31553/
http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/26052/console shows that this build failed referencing the failing job https://openqa.opensuse.org/tests/4500271#step/test_distribution/1 as expected.
The problem is in http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/26051/ triggered 2024-09-23 09:54Z but https://github.com/os-autoinst/openQA/commit/00b92eebc35748116e937fe493e0511518800d30 introducing the regression was created 2024-09-23 11:18Z so openQA-in-openQA tests started based on the old, stable packages ending up successfully. But in the meantime new package builds were triggered so the submit-job ended up submitting the faulty version of openQA packages.
Following that tests failed in Tumbleweed: https://openqa.opensuse.org/tests/4504430#step/openqa_bootstrap/14 failed 2024-09-24 13:32Z.

Five Whys

  1. Why was https://build.opensuse.org/request/show/1202651 created at a time when https://openqa.opensuse.org/tests/4500271

Ideas

  • The state of the OBS repository can change after tests were triggered/monitored. When submitting, we need to make sure the OBS repository hasn't changed in the meantime to submit only what we have tested.
    • This means we would not be able to submit anything if we frequently update the OBS repo. An alternative to avoid this would be to save the OBS repo upfront (e.g. make a branch) so we can later always submit the exact version we have tested.
    • Maybe we can disable the build or services while we are testing in our pipelines until we have copied into devel:openQA:tested or even better copy a specific revision, e.g. osc -r $rev co
    • or https://en.opensuse.org/openSUSE:Build_Service_Tips_and_Tricks , "Disable build of packages", osc api -X POST "/source/PROJECT/PACKAGE?cmd=set_flag&flag=build&status=disable" # and later ...

Related issues 3 (0 open3 closed)

Copied from openQA Project (public) - action #166658: Trigger os-autoinst-distri-example tests from fresh openQA instances via a button on the index page size:SResolvedmkittler2024-10-02

Actions
Copied to openQA Project (public) - action #167389: Conduct "lessons learned" with Five Why analysis for 2024-09-25 GRU git errors on openqa.opensuse.orgResolvedokurz2024-09-25

Actions
Copied to openQA Project (public) - action #167395: Ensure only the tested revision of devel:openQA packages are submitted to openSUSE:Factory size:MResolvedmkittler2024-09-252024-12-12

Actions
Actions #1

Updated by okurz 3 months ago

  • Copied from action #166658: Trigger os-autoinst-distri-example tests from fresh openQA instances via a button on the index page size:S added
Actions #2

Updated by okurz 3 months ago

  • Description updated (diff)
Actions #3

Updated by okurz 3 months ago

  • Copied to action #167389: Conduct "lessons learned" with Five Why analysis for 2024-09-25 GRU git errors on openqa.opensuse.org added
Actions #4

Updated by okurz 3 months ago

  • Copied to action #167395: Ensure only the tested revision of devel:openQA packages are submitted to openSUSE:Factory size:M added
Actions #5

Updated by okurz 3 months ago

  • Status changed from New to Resolved

I created the important follow-up #167395

Actions

Also available in: Atom PDF