Project

General

Profile

Actions

action #108665

closed

openqa_from_containers repeatedly failing at build

Added by livdywan about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-03-21
Due date:
% Done:

0%

Estimated time:

Description

openqa_from_containers has repeatedly failed at build, https://openqa.opensuse.org/tests/2256377 is presumably the first time:

# Test died: command 'for i in {1..3}; do docker build openQA/container/worker -t openqa_worker && break; done' timed out at openqa//tests/containers/build.pm line 9.

See https://openqa.opensuse.org/tests/2257149#next_previous

Actions #1

Updated by okurz about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #2

Updated by okurz about 2 years ago

A quick glimpse on https://openqa.opensuse.org/tests/2256377#next_previous shows 7/100 test failures in "build", all on 2022-03-21, no failures in "build" since then. In all cases it were different specific errors but all within zypper calls. Within os-autoinst-distri-opensuse zypper is already mostly called with retrying. Likely we should do the equivalent here. Sadly there is no movement on https://github.com/openSUSE/zypper/issues/420 to do that within zypper itself. Created https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/81 for a trivial timeout bump. This accounts for 3/10 job failures. The other 7 are all failing in some zypper internal step so we need actual retry with potential parsing of the error message to only retry on them, not all in general.

Actions #3

Updated by okurz about 2 years ago

I am thinking of a more generic retry wrapper. For this I am trying to continue with https://github.com/okurz/scripts/blob/master/retry which I slightly optimized in https://github.com/okurz/scripts/commit/aacb70e8281e2783877e18286b2d4cafe5f60ec5 and added tests with https://github.com/okurz/scripts/commit/2a4d102305eec3ec93b831a88d421b35fe14c276 and https://github.com/okurz/scripts/commit/8048a66b662dcf15c977fc63d5d2f54e3b45c34b and https://github.com/okurz/scripts/commit/0509f3f1e4084e68d7f5a5d3460b532ba83be0d3 before I go about extending it. I am thinking of extending the retry script to have a passlist and blocklist of regex matches on the command output so that it would, if specified, only retry if a passlist entry matches the command output or retry unless a blocklist entry matches the command output. https://unix.stackexchange.com/a/499091 looks promising

Actions #4

Updated by openqa_review about 2 years ago

  • Due date set to 2022-04-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz about 2 years ago

  • Due date deleted (2022-04-07)
  • Status changed from In Progress to Resolved

Looking at the video I observed that already the first retry was aborted internally out of up to three retries with an overall timeout of 1h.

I added retry on the job level:

@@ -38,6 +38,6 @@
         settings:
           OPENQA_CONTAINERS: '1'
           OPENQA_FROM_GIT: '1' # see: main.pm, avoid load_osautoinst_tests
+          RETRY: 2:https://progress.opensuse.org/issues/108665
         description: >-
-          Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa"
+          Maintainer: okurz@suse.de Test for running openQA itself from containers. To be used with "openqa" distri. Introduced retry on the job level due to https://progress.opensuse.org/issues/108665 as there can still be sporadic network issues sometimes.
-          distri.

The last 80 jobs have been stable regarding zypper download.

Actions

Also available in: Atom PDF