action #167335
closedQA (public) - coordination #162890: [saga][epic] feature discoverability
coordination #162896: [epic] Job triggering on jobless openQA instances
Conduct "lessons learned" with Five Why analysis for GRU git cloning related errors
0%
Description
Motivation¶
Apparently a problem related to either #166658 or #164898 showed up in Tumbleweed openQA-in-openQA tests, reported in https://bugzilla.suse.com/show_bug.cgi?id=1230953 , but not devel:openQA based openQA-in-openQA tests in https://openqa.opensuse.org/group_overview/24 . The next morning, 2024-09-25, at least Tumbleweed x86_64 and aarch64 were significantly affected, see https://suse.slack.com/archives/C02CANHLANP/p1727237231866739 and https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$ZqnHKAJ3VZEn2R57ynsS7UozSKC63TK3BqtZGJZ2ERA
Quoting Dimstar
I tell you though: It's scary to wake up, look at openQA and see a 100% test fail rate on a new snapshot
Acceptance criteria¶
- AC1: A Five-Whys analysis has been conducted and results documented
- AC2: Improvements are planned
Suggestions¶
- DONE Organize a call to conduct the 5 whys (not as part of the retro)
- Conduct "Five-Whys" analysis for the topic
- Identify follow-up tasks in tickets
What happened?¶
https://github.com/os-autoinst/openQA/pull/5940 "Fix initial cloning via fetchneedles after 313ee7a1" was created due to https://openqa.opensuse.org/tests/4500449#step/test_distribution/3 failing in openQA-in-openQA tests. So we observed a problem after merge to master, caught in openQA-in-openQA tests based on packages in devel:openQA. The PR was merged 2024-09-23 13:18Z. https://build.opensuse.org/request/show/1202651 was created 2024-09-23 12:51Z so before the fix. The SR was likely created by http://jenkins.qa.suse.de/job/submit-openQA-TW-to-oS_Fctry/1092/ triggered 2024-09-23 12:41Z which itself was triggered by a build within http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/, probably http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/26052/ which was triggered by
http://jenkins.qa.suse.de/job/trigger-openQA_in_openQA-TW/31553/
http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/26052/console shows that this build failed referencing the failing job https://openqa.opensuse.org/tests/4500271#step/test_distribution/1 as expected.
The problem is in http://jenkins.qa.suse.de/job/monitor-openQA_in_openQA-TW/26051/ triggered 2024-09-23 09:54Z but https://github.com/os-autoinst/openQA/commit/00b92eebc35748116e937fe493e0511518800d30 introducing the regression was created 2024-09-23 11:18Z so openQA-in-openQA tests started based on the old, stable packages ending up successfully. But in the meantime new package builds were triggered so the submit-job ended up submitting the faulty version of openQA packages.
Following that tests failed in Tumbleweed: https://openqa.opensuse.org/tests/4504430#step/openqa_bootstrap/14 failed 2024-09-24 13:32Z.
Five Whys¶
- Why was https://build.opensuse.org/request/show/1202651 created at a time when https://openqa.opensuse.org/tests/4500271
Ideas¶
- The state of the OBS repository can change after tests were triggered/monitored. When submitting, we need to make sure the OBS repository hasn't changed in the meantime to submit only what we have tested.
- This means we would not be able to submit anything if we frequently update the OBS repo. An alternative to avoid this would be to save the OBS repo upfront (e.g. make a branch) so we can later always submit the exact version we have tested.
- Maybe we can disable the build or services while we are testing in our pipelines until we have copied into devel:openQA:tested or even better copy a specific revision, e.g.
osc -r $rev co
- or https://en.opensuse.org/openSUSE:Build_Service_Tips_and_Tricks , "Disable build of packages",
osc api -X POST "/source/PROJECT/PACKAGE?cmd=set_flag&flag=build&status=disable" # and later ...