action #102221: t/25-cache-service.t fails exceeding 90s timeout consistently size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #102221

closed

t/25-cache-service.t fails exceeding 90s timeout consistently size:M

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

kraih

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-11-10

Due date:

2021-11-27

% Done:

Estimated time:

Description

Observation¶

Running t/25-cache-service.t locally for me consistently fails with

    ok 10 - Asset sle-12-SP3-x86_64-0368-200_88888@64bit.qcow2 downloaded correctly
    ok 11 - Asset sle-12-SP3-x86_64-0368-200_88888@64bit.qcow2 downloaded correctly
    ok 12 - Asset sle-12-SP3-x86_64-0368-200_88888@64bit.qcow2 downloaded correctly
<5>[26308] [i] Worker 26308 stopped
    # Cache minion worker stopped
<5>[26309] [i] Worker 26309 stopped
    # Cache minion worker stopped
Bailout called.  Further testing stopped:  test 't/25-cache-service.t' exceeds runtime limit of '90' seconds
Bail out!  test 't/25-cache-service.t' exceeds runtime limit of '90' seconds
FAILED--Further testing stopped: test 't/25-cache-service.t' exceeds runtime limit of '90' seconds

I ran count_fail_ratio prove -l -v t/25-cache-service.t using https://github.com/okurz/scripts/blob/master/count_fail_ratio with default 20 runs and all 20 runs fail consistently in the above step.

Problem¶

circleCI already runs Leap 15.3 as well https://github.com/os-autoinst/openQA/blob/master/container/devel:openQA:ci/base/Dockerfile#L7 and that works just fine there. If I bump the timeout 90s->300s then tests pass fine.

Acceptance criteria¶

AC1: Time limit of 90s is not exceeded

Suggestions¶

Run tests in a clean container e.g. toolbox -u -i registry.opensuse.org/opensuse/leap:15.3 -t leap15.3 or similar
Maybe files left over from another test run?
Consider a git status check like we have for os-autoinst
Investigate what's slowing the test down e.g. bisect test runtime from when the timeout was introduced or changed

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by livdywan over 3 years ago

Subject changed from t/25-cache-service.t fails (for me, on Leap 15.3) running into timeout consistently (longer timeouts work) to t/25-cache-service.t fails exceeding 90s timeout consistently size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by kraih over 3 years ago

Assignee set to kraih

Actions

Copy link

Updated by kraih over 3 years ago

Status changed from Workable to In Progress

Actions

Copy link

Updated by openqa_review over 3 years ago

Due date set to 2021-11-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by kraih over 3 years ago

I went back to January 2021 in 2 week intervals and i've been unable to measure any big jumps in test runtime. For me it is very consistently around 103 seconds on a very fast machine (with 2-3 seconds variance here and there).

All tests successful.
Files=1, Tests=23, 103 wallclock secs ( 0.07 usr  0.01 sys + 15.86 cusr  1.72 csys = 17.66 CPU)
Result: PASS

Actions

Copy link

Updated by kraih over 3 years ago

So i'm proposing to just bump the time limit to 110 seconds. https://github.com/os-autoinst/openQA/pull/4359

Actions

Copy link

Updated by kraih over 3 years ago

Honestly, i suspect that everyone on the team just runs tests with OPENQA_TEST_TIMEOUT_DISABLE locally now, and that's why the values don't match reality anymore. :)

Actions

Copy link

Updated by okurz over 3 years ago

kraih wrote:

Honestly, i suspect that everyone on the team just runs tests with OPENQA_TEST_TIMEOUT_DISABLE locally now, and that's why the values don't match reality anymore. :)

well, I don't so this is how I saw the issue ;) Actually I assume that most just don't run all the tests but rely on the CI.

Good that you tried to bisect. The question could be if there actually is any older state that works because if not that makes it more likely that other dependencies have introduced the change in runtime.

Actions

Copy link

Updated by okurz over 3 years ago

I realized now the updated AC which IMHO does not make much sense. It should rather say something like "t/25-cache-service.t runs consistently faster than 90s locally"

Actions

Copy link

#10

Updated by kraih over 3 years ago

okurz wrote:

Actually I assume that most just don't run all the tests but rely on the CI.

True. And i think only Marius and me have worked on the cache service recently. Both of us use openqa-test for local development, which sets OPENQA_TEST_TIMEOUT_DISABLE=1.

Actions

Copy link

#11

Updated by kraih over 3 years ago

Looking through the test code, i think i might have an idea for speeding up the tests. It currently spends most of its time waiting for workers to stop.

Actions

Copy link

#12

Updated by kraih over 3 years ago

Opened another PR with a significant performance improvement. https://github.com/os-autoinst/openQA/pull/4362

Actions

Copy link

#13

Updated by kraih over 3 years ago

Status changed from In Progress to Resolved

Actions

Copy link

#14

Updated by okurz over 2 years ago

Related to action #112898: Minion workers alert triggering on and off size:M added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #102221

t/25-cache-service.t fails exceeding 90s timeout consistently size:M

Observation¶

Problem¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by openqa_review over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by okurz over 2 years ago