action #108665
openqa_from_containers repeatedly failing at build
0%
Description
openqa_from_containers has repeatedly failed at build, https://openqa.opensuse.org/tests/2256377 is presumably the first time:
# Test died: command 'for i in {1..3}; do docker build openQA/container/worker -t openqa_worker && break; done' timed out at openqa//tests/containers/build.pm line 9.
History
#2
Updated by okurz 3 months ago
A quick glimpse on https://openqa.opensuse.org/tests/2256377#next_previous shows 7/100 test failures in "build", all on 2022-03-21, no failures in "build" since then. In all cases it were different specific errors but all within zypper calls. Within os-autoinst-distri-opensuse zypper is already mostly called with retrying. Likely we should do the equivalent here. Sadly there is no movement on https://github.com/openSUSE/zypper/issues/420 to do that within zypper itself. Created https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/81 for a trivial timeout bump. This accounts for 3/10 job failures. The other 7 are all failing in some zypper internal step so we need actual retry with potential parsing of the error message to only retry on them, not all in general.
#3
Updated by okurz 3 months ago
I am thinking of a more generic retry wrapper. For this I am trying to continue with https://github.com/okurz/scripts/blob/master/retry which I slightly optimized in https://github.com/okurz/scripts/commit/aacb70e8281e2783877e18286b2d4cafe5f60ec5 and added tests with https://github.com/okurz/scripts/commit/2a4d102305eec3ec93b831a88d421b35fe14c276 and https://github.com/okurz/scripts/commit/8048a66b662dcf15c977fc63d5d2f54e3b45c34b and https://github.com/okurz/scripts/commit/0509f3f1e4084e68d7f5a5d3460b532ba83be0d3 before I go about extending it. I am thinking of extending the retry script to have a passlist and blocklist of regex matches on the command output so that it would, if specified, only retry if a passlist entry matches the command output or retry unless a blocklist entry matches the command output. https://unix.stackexchange.com/a/499091 looks promising
#4
Updated by openqa_review 3 months ago
- Due date set to 2022-04-07
Setting due date based on mean cycle time of SUSE QE Tools
#5
Updated by okurz 3 months ago
- Due date deleted (
2022-04-07) - Status changed from In Progress to Resolved
Looking at the video I observed that already the first retry was aborted internally out of up to three retries with an overall timeout of 1h.
I added retry on the job level:
@@ -38,6 +38,6 @@
settings:
OPENQA_CONTAINERS: '1'
OPENQA_FROM_GIT: '1' # see: main.pm, avoid load_osautoinst_tests
+ RETRY: 2:https://progress.opensuse.org/issues/108665
description: >-
- Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa"
+ Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa" distri. Introduced retry on the job level due to https://progress.opensuse.org/issues/108665 as there can still be sporadic network issues sometimes.
- distri.
The last 80 jobs have been stable regarding zypper download.