Project

General

Profile

Actions

action #98898

closed

`t/05-scheduler-full.t` sometimes fails in CircleCI size:M

Added by mkittler about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-09-20
Due date:
2021-10-06
% Done:

0%

Estimated time:

Description

The test is already deemed unstable but now also fails repeatedly despite the retry we have for unstable tests. There currently seem to be two types of failures which might have the same cause or not.


Test Summary Report
-------------------
t/05-scheduler-full.t (Wstat: 256 Tests: 5 Failed: 1)
  Failed test:  3
  Non-zero exit status: 1
Files=2, Tests=51, 598.414 wallclock secs ( 1.77 usr  0.10 sys + 501.36 cusr 22.31 csys = 525.54 CPU)
Result: FAIL
Retry 5 of 5 …
[16:38:38] t/05-scheduler-full.t .. 2/?
    #   Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 219.
    #          got: 'running'
    #     expected: 'done'

    #   Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 221.
    #          got: 'none'
    #     expected: 'incomplete'

    #   Failed test 'reason is set'
    #   at t/05-scheduler-full.t line 223.
    #                   undef
    #     doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
    # Looks like you failed 3 tests of 12.
[16:38:38] t/05-scheduler-full.t .. 3/?
#   Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
#   at t/05-scheduler-full.t line 227.
[16:38:38] t/05-scheduler-full.t .. 4/? [16:38:38] t/05-scheduler-full.t .. 5/? # Looks like you failed 1 test of 5.
                                         [16:38:38] t/05-scheduler-full.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/5 subtests

So far I only saw the problem on https://github.com/os-autoinst/openQA/pull/4213 (https://app.circleci.com/pipelines/github/os-autoinst/openQA/7726/workflows/132858b7-d6bd-4df0-9667-50caca647581/jobs/73025) but it seems unrelated because the test passes locally and re-triggering eventually helped. I've attached the detailed log.


[09:08:13] t/05-scheduler-full.t .. 3/? 
    #   Failed test 'Allocated maximum number of jobs that could have been allocated'
    #   at t/05-scheduler-full.t line 240.
    #          got: '9'
    #     expected: '10'
    # Looks like you failed 1 test of 1.

This one has been seen on multiple PRs, e.g. https://github.com/os-autoinst/openQA/pull/4208 (https://app.circleci.com/pipelines/github/os-autoinst/openQA/7736/workflows/afd32447-77c5-4f29-b1fe-aecce3dbc434/jobs/73024).

Acceptance criteria

  • AC1: scheduler-full.t is reliable in CircleCI after repeated runs

Suggestion

  • Check if there's a hard-coded wait e.g. 5 minutes and proceeds before the jobs have finished
  • Try to reproduce locally
  • Rough idea, try and insert a sleep where the job done happens i.e. via mocking in the test to make it slow
  • Come up with an improvement based on CircleCI logs and reading code, and monitor CI to see if it improves, if it can't be reproduced locally

Files

05-scheduler-full.t (226 KB) 05-scheduler-full.t mkittler, 2021-09-20 09:44
Actions

Also available in: Atom PDF