Project

General

Profile

Actions

action #95848

closed

[sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M

Added by tinita over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-22
Due date:
% Done:

0%

Estimated time:

Description

Problem

t/05-scheduler-full.t fails more often than it used too (it is in "unstable" already)

https://app.circleci.com/pipelines/github/os-autoinst/openQA/7142/workflows/7a75593c-a719-4da8-934e-4c4433f67dc2/jobs/67405/steps

[09:54:43] t/05-scheduler-full.t ..................... 2/? 
    #   Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 220.
    #          got: 'running'
    #     expected: 'done'

    #   Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 222.
    #          got: 'none'
    #     expected: 'incomplete'

    #   Failed test 'reason is set'
    #   at t/05-scheduler-full.t line 224.
    #                   undef
    #     doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
    # Looks like you failed 3 tests of 12.
[09:54:43] t/05-scheduler-full.t ..................... 3/? 
#   Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
#   at t/05-scheduler-full.t line 228.
[09:54:43] t/05-scheduler-full.t ..................... 4/? [09:54:43] t/05-scheduler-full.t ..................... 5/? # Looks like you failed 1 test of 5.
                                                            [09:54:43] t/05-scheduler-full.t ..................... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/5 subtests 

https://app.circleci.com/pipelines/github/os-autoinst/openQA/7147/workflows/fe33092f-ebf0-4224-a1a9-234babeff802/jobs/67468/steps

[14:04:04] t/05-scheduler-full.t ..................... 1/? 
    #   Failed test 'no more jobs need to be allocated'
    #   at t/05-scheduler-full.t line 133.
    #          got: '1'
    #     expected: '0'
    # Looks like you failed 1 test of 4.
[14:04:04] t/05-scheduler-full.t ..................... 2/? 
#   Failed test 'Scheduler worker job allocation'
#   at t/05-scheduler-full.t line 137.

Acceptance Criteria

AC 1: t/05-scheduler-full.t is stable for 10 runs after apply the solution

Suggestions

  • Just an observation: It became more unstable when many of our tests on CircleCI became slower and unstable

Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:MResolvedmkittler2021-07-222021-08-07

Actions
Actions #1

Updated by tinita over 3 years ago

  • Copied from action #95839: [sporadic] CircleCI: openQA t/ui/14-dashboard.t flaky size:M added
Actions #2

Updated by tinita over 3 years ago

  • Copied to action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M added
Actions #3

Updated by ilausuch over 3 years ago

  • Subject changed from CircleCI: openQA t/05-scheduler-full.t flaky to CircleCI: openQA t/05-scheduler-full.t flaky size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Normal to Urgent
Actions #4

Updated by okurz over 3 years ago

  • Subject changed from CircleCI: openQA t/05-scheduler-full.t flaky size:M to [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M
Actions #5

Updated by tinita over 3 years ago

  • Copied to deleted (action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M)
Actions #6

Updated by tinita over 3 years ago

  • Related to action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M added
Actions #7

Updated by tinita over 3 years ago

  • Copied from deleted (action #95839: [sporadic] CircleCI: openQA t/ui/14-dashboard.t flaky size:M)
Actions #8

Updated by tinita over 3 years ago

  • Description updated (diff)
Actions #9

Updated by ilausuch over 3 years ago

Checking on the error lines I found that

For the line 220 we have an active waiting

    for (0 .. 100) {
        last if $jobs->find(99982)->state eq OpenQA::Jobs::Constants::DONE;
        sleep .2;
    }

I created this PR https://github.com/os-autoinst/openQA/pull/4074 that increase the timeout 10 seconds for this case.
This waits 20 seconds. Maybe on the cases that there is more load that usual then we don't wait enough time.

Actions #10

Updated by ilausuch over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to ilausuch
Actions #11

Updated by ilausuch over 3 years ago

The problem here is how to ensure that the tests we run can prove that this works when the load is high

Actions #12

Updated by ilausuch over 3 years ago

About the second problem on line 133. I am not sure how it could happens tanking in consideration the previous checks. Unless the ->schedule() method returns too early a result and we have to wait.

Actions #13

Updated by tinita over 3 years ago

to me the failure in line 133 looks like the same problem, it calls wait_for_worker a few lines above.

Actions #14

Updated by openqa_review over 3 years ago

  • Due date set to 2021-08-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by ilausuch over 3 years ago

  • Due date deleted (2021-08-10)
  • Status changed from In Progress to Workable

I leave this ticket. No more ideas

Actions #16

Updated by ilausuch over 3 years ago

  • Assignee deleted (ilausuch)
Actions #17

Updated by okurz over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #19

Updated by okurz over 3 years ago

tests failed in 1/4 runs

…Rerun 4 of 100 …
[12:45:46] t/05-scheduler-full.t ..................... 3/? 
    #   Failed test 'sub process openqa-webapi terminated with exit code 9'
    #   at /home/squamata/project/t/lib/OpenQA/Test/Utils.pm line 223.
A context appears to have been destroyed without first calling release().
Based on $@ it does not look like an exception was thrown (this is not always
a reliable test)

This is a problem because the global error variables ($!, $@, and $?) will
not be restored. In addition some release callbacks will not work properly from
inside a DESTROY method.

Here are the context creation details, just in case a tool forgot to call
release():
  File: t/05-scheduler-full.t
  Line: 277
  Tool: Test::More::subtest

Cleaning up the CONTEXT stack...
# Test ended with extra hubs on the stack!
    # Tests were run but no plan was declared and done_testing() was not seen.
[12:45:46] t/05-scheduler-full.t .....................      Dubious, test returned 255 (wstat 65280, 0xff00)
All 3 subtests passed

which again looks to be a different problem. t/lib/OpenQA/Test/Utils.pm line 223. is sub _fail_and_exit { fail shift; done_testing; exit shift }

Locally I could not yet reproduce a problem trying with FULLSTACK=1 SCHEDULER_FULLSTACK=1 count_fail_ratio prove -l t/05-scheduler-full.t. Will try to go higher and with coverage collection overhead. Trying with make coverage FULLSTACK=1 SCHEDULER_FULLSTACK=1 RETRY=200 STABILITY_TEST=1 KEEP_DB=1 TESTS=t/05-scheduler-full.t.

This actually managed to reproduce the first, originally reported issue:

    not ok 6 - running job set to done if its worker re-connects claiming not to work on it anymore
    #   Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 220.
    #          got: 'running'
    #     expected: 'done'
    not ok 7 - running job incompleted if its worker re-connects claiming not to work on it anymore
    #   Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 222.
    #          got: 'none'
    #     expected: 'incomplete'
    not ok 8 - reason is set
    #   Failed test 'reason is set'
    #   at t/05-scheduler-full.t line 224.
    #                   undef
    #     doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
[info] [pid:19210] Worker 5 websocket connection closed - 1006
    1..8
    # Looks like you failed 6 tests of 8.
not ok 3 - re-scheduling and incompletion of jobs when worker rejects jobs or goes offline

In another run I could reproduce the second error in rerun 12.

Actions #20

Updated by okurz over 3 years ago

  • Status changed from In Progress to Feedback

As these seem to be the same problems we already had I suggest we simply bump the numbers of retries: https://github.com/os-autoinst/openQA/pull/4085

Actions #21

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

I merged https://github.com/os-autoinst/openQA/pull/4085 . This will allow 5 instead of just 3 retries for the fullstack unstable tests. I also did https://github.com/os-autoinst/openQA/pull/4084 as further improvement. So we can close this ticket with the fix and the improvement present.

Actions

Also available in: Atom PDF