action #95848: [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #95848

closed

[sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M

Added by tinita almost 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-07-22

Due date:

% Done:

Estimated time:

Description

Problem¶

t/05-scheduler-full.t fails more often than it used too (it is in "unstable" already)

https://app.circleci.com/pipelines/github/os-autoinst/openQA/7142/workflows/7a75593c-a719-4da8-934e-4c4433f67dc2/jobs/67405/steps

[09:54:43] t/05-scheduler-full.t ..................... 2/? 
    #   Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 220.
    #          got: 'running'
    #     expected: 'done'

    #   Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 222.
    #          got: 'none'
    #     expected: 'incomplete'

    #   Failed test 'reason is set'
    #   at t/05-scheduler-full.t line 224.
    #                   undef
    #     doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
    # Looks like you failed 3 tests of 12.
[09:54:43] t/05-scheduler-full.t ..................... 3/? 
#   Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
#   at t/05-scheduler-full.t line 228.
[09:54:43] t/05-scheduler-full.t ..................... 4/? [09:54:43] t/05-scheduler-full.t ..................... 5/? # Looks like you failed 1 test of 5.
                                                            [09:54:43] t/05-scheduler-full.t ..................... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/5 subtests

https://app.circleci.com/pipelines/github/os-autoinst/openQA/7147/workflows/fe33092f-ebf0-4224-a1a9-234babeff802/jobs/67468/steps

[14:04:04] t/05-scheduler-full.t ..................... 1/? 
    #   Failed test 'no more jobs need to be allocated'
    #   at t/05-scheduler-full.t line 133.
    #          got: '1'
    #     expected: '0'
    # Looks like you failed 1 test of 4.
[14:04:04] t/05-scheduler-full.t ..................... 2/? 
#   Failed test 'Scheduler worker job allocation'
#   at t/05-scheduler-full.t line 137.

Acceptance Criteria¶

AC 1: t/05-scheduler-full.t is stable for 10 runs after apply the solution

Suggestions¶

Just an observation: It became more unstable when many of our tests on CircleCI became slower and unstable

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by tinita almost 4 years ago

Copied from action #95839: [sporadic] CircleCI: openQA t/ui/14-dashboard.t flaky size:M added

Actions

Copy link

Updated by tinita almost 4 years ago

Copied to action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M added

Actions

Copy link

Updated by ilausuch almost 4 years ago

Subject changed from CircleCI: openQA t/05-scheduler-full.t flaky to CircleCI: openQA t/05-scheduler-full.t flaky size:M
Description updated (diff)
Status changed from New to Workable
Priority changed from Normal to Urgent

Actions

Copy link

Updated by okurz almost 4 years ago

Subject changed from CircleCI: openQA t/05-scheduler-full.t flaky size:M to [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M

Actions

Copy link

Updated by tinita almost 4 years ago

Copied to deleted (action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M)

Actions

Copy link

Updated by tinita almost 4 years ago

Related to action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M added

Actions

Copy link

Updated by tinita almost 4 years ago

Copied from deleted (action #95839: [sporadic] CircleCI: openQA t/ui/14-dashboard.t flaky size:M)

Actions

Copy link

Updated by tinita over 3 years ago

Description updated (diff)

Actions

Copy link

Updated by ilausuch over 3 years ago

Checking on the error lines I found that

For the line 220 we have an active waiting

    for (0 .. 100) {
        last if $jobs->find(99982)->state eq OpenQA::Jobs::Constants::DONE;
        sleep .2;
    }

I created this PR https://github.com/os-autoinst/openQA/pull/4074 that increase the timeout 10 seconds for this case.
This waits 20 seconds. Maybe on the cases that there is more load that usual then we don't wait enough time.

Actions

Copy link

#10

Updated by ilausuch over 3 years ago

Status changed from Workable to In Progress
Assignee set to ilausuch

Actions

Copy link

#11

Updated by ilausuch over 3 years ago

The problem here is how to ensure that the tests we run can prove that this works when the load is high

Actions

Copy link

#12

Updated by ilausuch over 3 years ago

About the second problem on line 133. I am not sure how it could happens tanking in consideration the previous checks. Unless the ->schedule() method returns too early a result and we have to wait.

Actions

Copy link

#13

Updated by tinita over 3 years ago

to me the failure in line 133 looks like the same problem, it calls wait_for_worker a few lines above.

Actions

Copy link

#14

Updated by openqa_review over 3 years ago

Due date set to 2021-08-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#15

Updated by ilausuch over 3 years ago

Due date deleted (~~2021-08-10~~)
Status changed from In Progress to Workable

I leave this ticket. No more ideas

Actions

Copy link

#16

Updated by ilausuch over 3 years ago

Assignee deleted (~~ilausuch~~)

Actions

Copy link

#17

Updated by okurz over 3 years ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

Copy link

#18

Updated by okurz over 3 years ago

checking stability in https://github.com/os-autoinst/openQA/pull/4085

Actions

Copy link

#19

Updated by okurz over 3 years ago

tests failed in 1/4 runs

…Rerun 4 of 100 …
[12:45:46] t/05-scheduler-full.t ..................... 3/? 
    #   Failed test 'sub process openqa-webapi terminated with exit code 9'
    #   at /home/squamata/project/t/lib/OpenQA/Test/Utils.pm line 223.
A context appears to have been destroyed without first calling release().
Based on $@ it does not look like an exception was thrown (this is not always
a reliable test)

This is a problem because the global error variables ($!, $@, and $?) will
not be restored. In addition some release callbacks will not work properly from
inside a DESTROY method.

Here are the context creation details, just in case a tool forgot to call
release():
  File: t/05-scheduler-full.t
  Line: 277
  Tool: Test::More::subtest

Cleaning up the CONTEXT stack...
# Test ended with extra hubs on the stack!
    # Tests were run but no plan was declared and done_testing() was not seen.
[12:45:46] t/05-scheduler-full.t .....................      Dubious, test returned 255 (wstat 65280, 0xff00)
All 3 subtests passed

which again looks to be a different problem. t/lib/OpenQA/Test/Utils.pm line 223. is sub _fail_and_exit { fail shift; done_testing; exit shift }

Locally I could not yet reproduce a problem trying with FULLSTACK=1 SCHEDULER_FULLSTACK=1 count_fail_ratio prove -l t/05-scheduler-full.t. Will try to go higher and with coverage collection overhead. Trying with make coverage FULLSTACK=1 SCHEDULER_FULLSTACK=1 RETRY=200 STABILITY_TEST=1 KEEP_DB=1 TESTS=t/05-scheduler-full.t.

This actually managed to reproduce the first, originally reported issue:

    not ok 6 - running job set to done if its worker re-connects claiming not to work on it anymore
    #   Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 220.
    #          got: 'running'
    #     expected: 'done'
    not ok 7 - running job incompleted if its worker re-connects claiming not to work on it anymore
    #   Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
    #   at t/05-scheduler-full.t line 222.
    #          got: 'none'
    #     expected: 'incomplete'
    not ok 8 - reason is set
    #   Failed test 'reason is set'
    #   at t/05-scheduler-full.t line 224.
    #                   undef
    #     doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
[info] [pid:19210] Worker 5 websocket connection closed - 1006
    1..8
    # Looks like you failed 6 tests of 8.
not ok 3 - re-scheduling and incompletion of jobs when worker rejects jobs or goes offline

In another run I could reproduce the second error in rerun 12.

Actions

Copy link

#20

Updated by okurz over 3 years ago

Status changed from In Progress to Feedback

As these seem to be the same problems we already had I suggest we simply bump the numbers of retries: https://github.com/os-autoinst/openQA/pull/4085

Actions

Copy link

#21

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

I merged https://github.com/os-autoinst/openQA/pull/4085 . This will allow 5 instead of just 3 retries for the fullstack unstable tests. I also did https://github.com/os-autoinst/openQA/pull/4084 as further improvement. So we can close this ticket with the fix and the improvement present.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #95848

[sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M

Problem¶

Acceptance Criteria¶

Suggestions¶

Updated by tinita almost 4 years ago

Updated by tinita almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by okurz almost 4 years ago

Updated by tinita almost 4 years ago

Updated by tinita almost 4 years ago

Updated by tinita almost 4 years ago

Updated by tinita over 3 years ago

Updated by ilausuch over 3 years ago

Updated by ilausuch over 3 years ago

Updated by ilausuch over 3 years ago

Updated by ilausuch over 3 years ago

Updated by tinita over 3 years ago

Updated by openqa_review over 3 years ago

Updated by ilausuch over 3 years ago

Updated by ilausuch over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago