action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #95851

closed

[sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M

Added by tinita over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-07-22

Due date:

2021-08-07

% Done:

Estimated time:

Description

Problem¶

t/43-scheduling-and-worker-scalability.t fails more often than it used too (it is in "unstable" already)

https://app.circleci.com/pipelines/github/os-autoinst/openQA/7142/workflows/7a75593c-a719-4da8-934e-4c4433f67dc2/jobs/67405/steps

[10:02:57] t/43-scheduling-and-worker-scalability.t .. 1/? 
    #   Failed test 'all jobs done'
    #   at t/43-scheduling-and-worker-scalability.t line 218.
    #          got: '4'
    #     expected: '5'

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 219.
    #          got: '4'
    #     expected: '5'
    # All jobs:
    #  - id: 1, state: done, result: passed, reason: none
    #  - id: 2, state: done, result: passed, reason: none
    #  - id: 3, state: done, result: passed, reason: none
    #  - id: 4, state: done, result: passed, reason: none
    #  - id: 5, state: running, result: none, reason: none
    # Looks like you failed 2 tests of 3.
[10:02:57] t/43-scheduling-and-worker-scalability.t .. 2/? 
#   Failed test 'assign and run jobs'
#   at t/43-scheduling-and-worker-scalability.t line 221.
[10:02:57] t/43-scheduling-and-worker-scalability.t .. 3/? # Looks like you failed 1 test of 4.
                                                            [10:02:57] t/43-scheduling-and-worker-scalability.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/4 subtests

Acceptance Criteria¶

AC 1: t/43-scheduling-and-worker-scalability.t is stable for the next 10 runs after applying the solution

Suggestions¶

Just an observation: It became more unstable when many of our tests on CircleCI became slower and unstable
Only can be reproduced on CircleCI
Could be a timing problem
Consider the slowdown of collecting code coverage being a problem, especially regarding forking

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by tinita over 3 years ago

Copied from action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M added

Actions

Copy link

Updated by ilausuch over 3 years ago

Subject changed from CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky to CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by tinita over 3 years ago

Just for reference: https://github.com/os-autoinst/openQA/pull/4037

Actions

Copy link

Updated by okurz over 3 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by okurz over 3 years ago

Subject changed from CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M to [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M

Actions

Copy link

Updated by tinita over 3 years ago

Description updated (diff)

Actions

Copy link

Updated by tinita over 3 years ago

Copied from deleted (action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M)

Actions

Copy link

Updated by tinita over 3 years ago

Related to action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M added

Actions

Copy link

Updated by mkittler over 3 years ago

Assignee set to mkittler

It looks like this test is actually slower with coverage analysis enabled. However, that seems to be mainly the end when it generates the report and not so much the test execution itself. Maybe it makes nevertheless sense to use OpenQA::Test::TimeLimit::scale_timeout here.

Actions

Copy link

#10

Updated by mkittler over 3 years ago

I've been using OpenQA::Benchmark::Stopwatch to compare only the time of the test itself (and not db initialization and generation of the final coverage report) and even than there's quite a difference. Without coverage it takes ~ 2.990 s and with it takes ~ 12.418. It would make most sense to simply disable coverage tracking for this test in forked processes because we really shouldn't need it. (Nothing should only be covered by this test.)

Actions

Copy link

#11

Updated by mkittler over 3 years ago

Disabling the coverage report (Devel::Cover::report()) in the forked web UI and websocket server processes (the only forked processes which currently report coverage) doesn't make much of a difference. After that the time spend on the different sub tests looks like this:

coverage generally enabled but no Devel::Cover::report():
NAME                        TIME        CUMULATIVE      PERCENTAGE
 wait for workers to be idle 3.066       3.066           25.964%
 assign and run jobs         6.031       9.097           51.072%
 stop all workers            2.712       11.809          22.964%

coverage completely disabled:
NAME                        TIME        CUMULATIVE      PERCENTAGE
 wait for workers to be idle 2.125       2.125           70.712%
 assign and run jobs         0.853       2.978           28.378%
 stop all workers            0.027       3.006           0.910%

The functionality tested here is Perl-code extensive so maybe the coverage analysis slows down things even without reporting coverage data from forked processes.

So for now I would just apply OpenQA::Test::TimeLimit::scale_timeout here (PR: https://github.com/os-autoinst/openQA/pull/4070).

Actions

Copy link

#12

Updated by mkittler over 3 years ago

Status changed from Workable to In Progress

Actions

Copy link

#13

Updated by mkittler over 3 years ago

Created another PR: https://github.com/os-autoinst/openQA/pull/4072
See also https://github.com/os-autoinst/openQA/pull/4070#issuecomment-885688115 and previous comments for an explanation.

Actions

Copy link

#14

Updated by openqa_review over 3 years ago

Due date set to 2021-08-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#15

Updated by mkittler over 3 years ago

PR to test stability: https://github.com/os-autoinst/openQA/pull/4080

Actions

Copy link

#16

Updated by mkittler over 3 years ago

The test is now definitely more stable than before but still failed on the 163th run within CircleCI:

Rerun 163 of 200 …
[12:32:25] t/43-scheduling-and-worker-scalability.t .. 
[12:32:25] t/43-scheduling-and-worker-scalability.t .. 1/? 

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 220.
    #          got: '4'
    #     expected: '5'
    # All jobs:
    #  - id: 1, state: done, result: passed, reason: none
    #  - id: 2, state: done, result: incomplete, reason: setup failure: isotovideo can not be started
    #  - id: 3, state: done, result: passed, reason: none
    #  - id: 4, state: done, result: passed, reason: none
    #  - id: 5, state: done, result: passed, reason: none
    # Looks like you failed 1 test of 3.
[12:32:25] t/43-scheduling-and-worker-scalability.t .. 2/? 
#   Failed test 'assign and run jobs'
#   at t/43-scheduling-and-worker-scalability.t line 222.
# Looks like you failed 1 test of 4.

I haven't seen this problem before. Note that isotovideo is just a dummy shell script within this test. Maybe we're hitting some limit here (which we wouldn't if not running the test for the 163th time)?

I have already prepared a change for considering the test no longer unstable: https://github.com/Martchus/openQA/pull/new/unstable-43-scheduling-and-worker-scalability

Actions

Copy link

#17

Updated by mkittler over 3 years ago

After a re-run it fails even on an earlier attempt. This time it looks more like the timeout is nevertheless too low:

Rerun 46 of 200 …
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 1/? Mojo::Reactor::Poll: I/O watcher failed: Not an ARRAY reference at /home/squamata/project/script/../lib/OpenQA/Worker/Job.pm line 1129.

    #   Failed test 'all jobs done'
    #   at t/43-scheduling-and-worker-scalability.t line 219.
    #          got: '4'
    #     expected: '5'

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 220.
    #          got: '4'
    #     expected: '5'
    # All jobs:
    #  - id: 1, state: done, result: passed, reason: none
    #  - id: 2, state: done, result: passed, reason: none
    #  - id: 3, state: uploading, result: none, reason: none
    #  - id: 4, state: done, result: passed, reason: none
    #  - id: 5, state: done, result: passed, reason: none
    # Looks like you failed 2 tests of 3.
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 2/? 
#   Failed test 'assign and run jobs'
#   at t/43-scheduling-and-worker-scalability.t line 222.

    #   Failed test 'all workers offline'
    #   at t/43-scheduling-and-worker-scalability.t line 236.
    # 0
    # Looks like you failed 1 test of 1.
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 3/? 
#   Failed test 'stop all workers'
#   at t/43-scheduling-and-worker-scalability.t line 237.
# Looks like you failed 2 tests of 4.

Actions

Copy link

#18

Updated by mkittler over 3 years ago

Status changed from In Progress to Resolved

I'm marking the issue as resolved because the test is now as good as before. So it is still considered unstable but the retry we have for those tests should be sufficient.

Actions

Copy link

#19

Updated by livdywan about 3 years ago

Related to action #96545: t/43-scheduling-and-worker-scalability.t fails in multiple OBS checks size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #95851

[sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M

Problem¶

Acceptance Criteria¶

Suggestions¶

Updated by tinita over 3 years ago

Updated by ilausuch over 3 years ago

Updated by tinita over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by tinita over 3 years ago

Updated by tinita over 3 years ago

Updated by tinita over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by openqa_review over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan about 3 years ago