Project

General

Profile

Actions

action #95851

closed

[sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M

Added by tinita over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-22
Due date:
2021-08-07
% Done:

0%

Estimated time:

Description

Problem

t/43-scheduling-and-worker-scalability.t fails more often than it used too (it is in "unstable" already)

https://app.circleci.com/pipelines/github/os-autoinst/openQA/7142/workflows/7a75593c-a719-4da8-934e-4c4433f67dc2/jobs/67405/steps

[10:02:57] t/43-scheduling-and-worker-scalability.t .. 1/? 
    #   Failed test 'all jobs done'
    #   at t/43-scheduling-and-worker-scalability.t line 218.
    #          got: '4'
    #     expected: '5'

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 219.
    #          got: '4'
    #     expected: '5'
    # All jobs:
    #  - id: 1, state: done, result: passed, reason: none
    #  - id: 2, state: done, result: passed, reason: none
    #  - id: 3, state: done, result: passed, reason: none
    #  - id: 4, state: done, result: passed, reason: none
    #  - id: 5, state: running, result: none, reason: none
    # Looks like you failed 2 tests of 3.
[10:02:57] t/43-scheduling-and-worker-scalability.t .. 2/? 
#   Failed test 'assign and run jobs'
#   at t/43-scheduling-and-worker-scalability.t line 221.
[10:02:57] t/43-scheduling-and-worker-scalability.t .. 3/? # Looks like you failed 1 test of 4.
                                                            [10:02:57] t/43-scheduling-and-worker-scalability.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/4 subtests 

Acceptance Criteria

AC 1: t/43-scheduling-and-worker-scalability.t is stable for the next 10 runs after applying the solution

Suggestions

  • Just an observation: It became more unstable when many of our tests on CircleCI became slower and unstable
  • Only can be reproduced on CircleCI
  • Could be a timing problem
  • Consider the slowdown of collecting code coverage being a problem, especially regarding forking

Related issues 2 (0 open2 closed)

Related to openQA Project - action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:MResolvedokurz2021-07-22

Actions
Related to openQA Project - action #96545: t/43-scheduling-and-worker-scalability.t fails in multiple OBS checks size:SResolvedokurz2021-08-04

Actions
Actions #1

Updated by tinita over 2 years ago

  • Copied from action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M added
Actions #2

Updated by ilausuch over 2 years ago

  • Subject changed from CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky to CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz over 2 years ago

  • Priority changed from Normal to Urgent
Actions #5

Updated by okurz over 2 years ago

  • Subject changed from CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M to [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M
Actions #6

Updated by tinita over 2 years ago

  • Description updated (diff)
Actions #7

Updated by tinita over 2 years ago

  • Copied from deleted (action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M)
Actions #8

Updated by tinita over 2 years ago

  • Related to action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M added
Actions #9

Updated by mkittler over 2 years ago

  • Assignee set to mkittler

It looks like this test is actually slower with coverage analysis enabled. However, that seems to be mainly the end when it generates the report and not so much the test execution itself. Maybe it makes nevertheless sense to use OpenQA::Test::TimeLimit::scale_timeout here.

Actions #10

Updated by mkittler over 2 years ago

I've been using OpenQA::Benchmark::Stopwatch to compare only the time of the test itself (and not db initialization and generation of the final coverage report) and even than there's quite a difference. Without coverage it takes ~ 2.990 s and with it takes ~ 12.418. It would make most sense to simply disable coverage tracking for this test in forked processes because we really shouldn't need it. (Nothing should only be covered by this test.)

Actions #11

Updated by mkittler over 2 years ago

Disabling the coverage report (Devel::Cover::report()) in the forked web UI and websocket server processes (the only forked processes which currently report coverage) doesn't make much of a difference. After that the time spend on the different sub tests looks like this:

coverage generally enabled but no Devel::Cover::report():
NAME                        TIME        CUMULATIVE      PERCENTAGE
 wait for workers to be idle 3.066       3.066           25.964%
 assign and run jobs         6.031       9.097           51.072%
 stop all workers            2.712       11.809          22.964%

coverage completely disabled:
NAME                        TIME        CUMULATIVE      PERCENTAGE
 wait for workers to be idle 2.125       2.125           70.712%
 assign and run jobs         0.853       2.978           28.378%
 stop all workers            0.027       3.006           0.910%

The functionality tested here is Perl-code extensive so maybe the coverage analysis slows down things even without reporting coverage data from forked processes.

So for now I would just apply OpenQA::Test::TimeLimit::scale_timeout here (PR: https://github.com/os-autoinst/openQA/pull/4070).

Actions #12

Updated by mkittler over 2 years ago

  • Status changed from Workable to In Progress
Actions #13

Updated by mkittler over 2 years ago

Actions #14

Updated by openqa_review over 2 years ago

  • Due date set to 2021-08-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #16

Updated by mkittler over 2 years ago

The test is now definitely more stable than before but still failed on the 163th run within CircleCI:

Rerun 163 of 200 …
[12:32:25] t/43-scheduling-and-worker-scalability.t .. 
[12:32:25] t/43-scheduling-and-worker-scalability.t .. 1/? 

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 220.
    #          got: '4'
    #     expected: '5'
    # All jobs:
    #  - id: 1, state: done, result: passed, reason: none
    #  - id: 2, state: done, result: incomplete, reason: setup failure: isotovideo can not be started
    #  - id: 3, state: done, result: passed, reason: none
    #  - id: 4, state: done, result: passed, reason: none
    #  - id: 5, state: done, result: passed, reason: none
    # Looks like you failed 1 test of 3.
[12:32:25] t/43-scheduling-and-worker-scalability.t .. 2/? 
#   Failed test 'assign and run jobs'
#   at t/43-scheduling-and-worker-scalability.t line 222.
# Looks like you failed 1 test of 4.

I haven't seen this problem before. Note that isotovideo is just a dummy shell script within this test. Maybe we're hitting some limit here (which we wouldn't if not running the test for the 163th time)?

I have already prepared a change for considering the test no longer unstable: https://github.com/Martchus/openQA/pull/new/unstable-43-scheduling-and-worker-scalability

Actions #17

Updated by mkittler over 2 years ago

After a re-run it fails even on an earlier attempt. This time it looks more like the timeout is nevertheless too low:

Rerun 46 of 200 …
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 1/? Mojo::Reactor::Poll: I/O watcher failed: Not an ARRAY reference at /home/squamata/project/script/../lib/OpenQA/Worker/Job.pm line 1129.

    #   Failed test 'all jobs done'
    #   at t/43-scheduling-and-worker-scalability.t line 219.
    #          got: '4'
    #     expected: '5'

    #   Failed test 'all jobs passed'
    #   at t/43-scheduling-and-worker-scalability.t line 220.
    #          got: '4'
    #     expected: '5'
    # All jobs:
    #  - id: 1, state: done, result: passed, reason: none
    #  - id: 2, state: done, result: passed, reason: none
    #  - id: 3, state: uploading, result: none, reason: none
    #  - id: 4, state: done, result: passed, reason: none
    #  - id: 5, state: done, result: passed, reason: none
    # Looks like you failed 2 tests of 3.
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 2/? 
#   Failed test 'assign and run jobs'
#   at t/43-scheduling-and-worker-scalability.t line 222.

    #   Failed test 'all workers offline'
    #   at t/43-scheduling-and-worker-scalability.t line 236.
    # 0
    # Looks like you failed 1 test of 1.
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 3/? 
#   Failed test 'stop all workers'
#   at t/43-scheduling-and-worker-scalability.t line 237.
# Looks like you failed 2 tests of 4.
Actions #18

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Resolved

I'm marking the issue as resolved because the test is now as good as before. So it is still considered unstable but the retry we have for those tests should be sufficient.

Actions #19

Updated by livdywan over 2 years ago

  • Related to action #96545: t/43-scheduling-and-worker-scalability.t fails in multiple OBS checks size:S added
Actions

Also available in: Atom PDF