action #108665
closedopenqa_from_containers repeatedly failing at build
0%
Description
openqa_from_containers has repeatedly failed at build, https://openqa.opensuse.org/tests/2256377 is presumably the first time:
# Test died: command 'for i in {1..3}; do docker build openQA/container/worker -t openqa_worker && break; done' timed out at openqa//tests/containers/build.pm line 9.
Updated by okurz over 2 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Updated by okurz over 2 years ago
A quick glimpse on https://openqa.opensuse.org/tests/2256377#next_previous shows 7/100 test failures in "build", all on 2022-03-21, no failures in "build" since then. In all cases it were different specific errors but all within zypper calls. Within os-autoinst-distri-opensuse zypper is already mostly called with retrying. Likely we should do the equivalent here. Sadly there is no movement on https://github.com/openSUSE/zypper/issues/420 to do that within zypper itself. Created https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/81 for a trivial timeout bump. This accounts for 3/10 job failures. The other 7 are all failing in some zypper internal step so we need actual retry with potential parsing of the error message to only retry on them, not all in general.
Updated by okurz over 2 years ago
I am thinking of a more generic retry wrapper. For this I am trying to continue with https://github.com/okurz/scripts/blob/master/retry which I slightly optimized in https://github.com/okurz/scripts/commit/aacb70e8281e2783877e18286b2d4cafe5f60ec5 and added tests with https://github.com/okurz/scripts/commit/2a4d102305eec3ec93b831a88d421b35fe14c276 and https://github.com/okurz/scripts/commit/8048a66b662dcf15c977fc63d5d2f54e3b45c34b and https://github.com/okurz/scripts/commit/0509f3f1e4084e68d7f5a5d3460b532ba83be0d3 before I go about extending it. I am thinking of extending the retry script to have a passlist and blocklist of regex matches on the command output so that it would, if specified, only retry if a passlist entry matches the command output or retry unless a blocklist entry matches the command output. https://unix.stackexchange.com/a/499091 looks promising
Updated by openqa_review over 2 years ago
- Due date set to 2022-04-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
- Due date deleted (
2022-04-07) - Status changed from In Progress to Resolved
Looking at the video I observed that already the first retry was aborted internally out of up to three retries with an overall timeout of 1h.
I added retry on the job level:
@@ -38,6 +38,6 @@
settings:
OPENQA_CONTAINERS: '1'
OPENQA_FROM_GIT: '1' # see: main.pm, avoid load_osautoinst_tests
+ RETRY: 2:https://progress.opensuse.org/issues/108665
description: >-
- Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa"
+ Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa" distri. Introduced retry on the job level due to https://progress.opensuse.org/issues/108665 as there can still be sporadic network issues sometimes.
- distri.
The last 80 jobs have been stable regarding zypper download.