action #95851
closed[sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M
Description
Problem¶
t/43-scheduling-and-worker-scalability.t fails more often than it used too (it is in "unstable" already)
[10:02:57] t/43-scheduling-and-worker-scalability.t .. 1/?
# Failed test 'all jobs done'
# at t/43-scheduling-and-worker-scalability.t line 218.
# got: '4'
# expected: '5'
# Failed test 'all jobs passed'
# at t/43-scheduling-and-worker-scalability.t line 219.
# got: '4'
# expected: '5'
# All jobs:
# - id: 1, state: done, result: passed, reason: none
# - id: 2, state: done, result: passed, reason: none
# - id: 3, state: done, result: passed, reason: none
# - id: 4, state: done, result: passed, reason: none
# - id: 5, state: running, result: none, reason: none
# Looks like you failed 2 tests of 3.
[10:02:57] t/43-scheduling-and-worker-scalability.t .. 2/?
# Failed test 'assign and run jobs'
# at t/43-scheduling-and-worker-scalability.t line 221.
[10:02:57] t/43-scheduling-and-worker-scalability.t .. 3/? # Looks like you failed 1 test of 4.
[10:02:57] t/43-scheduling-and-worker-scalability.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/4 subtests
Acceptance Criteria¶
AC 1: t/43-scheduling-and-worker-scalability.t is stable for the next 10 runs after applying the solution
Suggestions¶
- Just an observation: It became more unstable when many of our tests on CircleCI became slower and unstable
- Only can be reproduced on CircleCI
- Could be a timing problem
- Consider the slowdown of collecting code coverage being a problem, especially regarding forking
Updated by tinita over 3 years ago
- Copied from action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M added
Updated by ilausuch over 3 years ago
- Subject changed from CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky to CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita over 3 years ago
Just for reference: https://github.com/os-autoinst/openQA/pull/4037
Updated by okurz over 3 years ago
- Subject changed from CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M to [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M
Updated by tinita over 3 years ago
- Copied from deleted (action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M)
Updated by tinita over 3 years ago
- Related to action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M added
Updated by mkittler over 3 years ago
- Assignee set to mkittler
It looks like this test is actually slower with coverage analysis enabled. However, that seems to be mainly the end when it generates the report and not so much the test execution itself. Maybe it makes nevertheless sense to use OpenQA::Test::TimeLimit::scale_timeout
here.
Updated by mkittler over 3 years ago
I've been using OpenQA::Benchmark::Stopwatch
to compare only the time of the test itself (and not db initialization and generation of the final coverage report) and even than there's quite a difference. Without coverage it takes ~ 2.990 s and with it takes ~ 12.418. It would make most sense to simply disable coverage tracking for this test in forked processes because we really shouldn't need it. (Nothing should only be covered by this test.)
Updated by mkittler over 3 years ago
Disabling the coverage report (Devel::Cover::report()
) in the forked web UI and websocket server processes (the only forked processes which currently report coverage) doesn't make much of a difference. After that the time spend on the different sub tests looks like this:
coverage generally enabled but no Devel::Cover::report():
NAME TIME CUMULATIVE PERCENTAGE
wait for workers to be idle 3.066 3.066 25.964%
assign and run jobs 6.031 9.097 51.072%
stop all workers 2.712 11.809 22.964%
coverage completely disabled:
NAME TIME CUMULATIVE PERCENTAGE
wait for workers to be idle 2.125 2.125 70.712%
assign and run jobs 0.853 2.978 28.378%
stop all workers 0.027 3.006 0.910%
The functionality tested here is Perl-code extensive so maybe the coverage analysis slows down things even without reporting coverage data from forked processes.
So for now I would just apply OpenQA::Test::TimeLimit::scale_timeout
here (PR: https://github.com/os-autoinst/openQA/pull/4070).
Updated by mkittler over 3 years ago
- Status changed from Workable to In Progress
Updated by mkittler over 3 years ago
Created another PR: https://github.com/os-autoinst/openQA/pull/4072
See also https://github.com/os-autoinst/openQA/pull/4070#issuecomment-885688115 and previous comments for an explanation.
Updated by openqa_review over 3 years ago
- Due date set to 2021-08-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 3 years ago
PR to test stability: https://github.com/os-autoinst/openQA/pull/4080
Updated by mkittler over 3 years ago
The test is now definitely more stable than before but still failed on the 163th run within CircleCI:
Rerun 163 of 200 …
[12:32:25] t/43-scheduling-and-worker-scalability.t ..
[12:32:25] t/43-scheduling-and-worker-scalability.t .. 1/?
# Failed test 'all jobs passed'
# at t/43-scheduling-and-worker-scalability.t line 220.
# got: '4'
# expected: '5'
# All jobs:
# - id: 1, state: done, result: passed, reason: none
# - id: 2, state: done, result: incomplete, reason: setup failure: isotovideo can not be started
# - id: 3, state: done, result: passed, reason: none
# - id: 4, state: done, result: passed, reason: none
# - id: 5, state: done, result: passed, reason: none
# Looks like you failed 1 test of 3.
[12:32:25] t/43-scheduling-and-worker-scalability.t .. 2/?
# Failed test 'assign and run jobs'
# at t/43-scheduling-and-worker-scalability.t line 222.
# Looks like you failed 1 test of 4.
I haven't seen this problem before. Note that isotovideo
is just a dummy shell script within this test. Maybe we're hitting some limit here (which we wouldn't if not running the test for the 163th time)?
I have already prepared a change for considering the test no longer unstable: https://github.com/Martchus/openQA/pull/new/unstable-43-scheduling-and-worker-scalability
Updated by mkittler over 3 years ago
After a re-run it fails even on an earlier attempt. This time it looks more like the timeout is nevertheless too low:
Rerun 46 of 200 …
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 1/? Mojo::Reactor::Poll: I/O watcher failed: Not an ARRAY reference at /home/squamata/project/script/../lib/OpenQA/Worker/Job.pm line 1129.
# Failed test 'all jobs done'
# at t/43-scheduling-and-worker-scalability.t line 219.
# got: '4'
# expected: '5'
# Failed test 'all jobs passed'
# at t/43-scheduling-and-worker-scalability.t line 220.
# got: '4'
# expected: '5'
# All jobs:
# - id: 1, state: done, result: passed, reason: none
# - id: 2, state: done, result: passed, reason: none
# - id: 3, state: uploading, result: none, reason: none
# - id: 4, state: done, result: passed, reason: none
# - id: 5, state: done, result: passed, reason: none
# Looks like you failed 2 tests of 3.
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 2/?
# Failed test 'assign and run jobs'
# at t/43-scheduling-and-worker-scalability.t line 222.
# Failed test 'all workers offline'
# at t/43-scheduling-and-worker-scalability.t line 236.
# 0
# Looks like you failed 1 test of 1.
[14:37:36] t/43-scheduling-and-worker-scalability.t .. 3/?
# Failed test 'stop all workers'
# at t/43-scheduling-and-worker-scalability.t line 237.
# Looks like you failed 2 tests of 4.
Updated by mkittler over 3 years ago
- Status changed from In Progress to Resolved
I'm marking the issue as resolved because the test is now as good as before. So it is still considered unstable but the retry we have for those tests should be sufficient.
Updated by livdywan about 3 years ago
- Related to action #96545: t/43-scheduling-and-worker-scalability.t fails in multiple OBS checks size:S added