action #95848
closed[sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M
Description
Problem¶
t/05-scheduler-full.t fails more often than it used too (it is in "unstable" already)
[09:54:43] t/05-scheduler-full.t ..................... 2/?
# Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
# at t/05-scheduler-full.t line 220.
# got: 'running'
# expected: 'done'
# Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
# at t/05-scheduler-full.t line 222.
# got: 'none'
# expected: 'incomplete'
# Failed test 'reason is set'
# at t/05-scheduler-full.t line 224.
# undef
# doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
# Looks like you failed 3 tests of 12.
[09:54:43] t/05-scheduler-full.t ..................... 3/?
# Failed test 're-scheduling and incompletion of jobs when worker rejects jobs or goes offline'
# at t/05-scheduler-full.t line 228.
[09:54:43] t/05-scheduler-full.t ..................... 4/? [09:54:43] t/05-scheduler-full.t ..................... 5/? # Looks like you failed 1 test of 5.
[09:54:43] t/05-scheduler-full.t ..................... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/5 subtests
[14:04:04] t/05-scheduler-full.t ..................... 1/?
# Failed test 'no more jobs need to be allocated'
# at t/05-scheduler-full.t line 133.
# got: '1'
# expected: '0'
# Looks like you failed 1 test of 4.
[14:04:04] t/05-scheduler-full.t ..................... 2/?
# Failed test 'Scheduler worker job allocation'
# at t/05-scheduler-full.t line 137.
Acceptance Criteria¶
AC 1: t/05-scheduler-full.t is stable for 10 runs after apply the solution
Suggestions¶
- Just an observation: It became more unstable when many of our tests on CircleCI became slower and unstable
Updated by tinita over 3 years ago
- Copied from action #95839: [sporadic] CircleCI: openQA t/ui/14-dashboard.t flaky size:M added
Updated by tinita over 3 years ago
- Copied to action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M added
Updated by ilausuch over 3 years ago
- Subject changed from CircleCI: openQA t/05-scheduler-full.t flaky to CircleCI: openQA t/05-scheduler-full.t flaky size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Normal to Urgent
Updated by okurz over 3 years ago
- Subject changed from CircleCI: openQA t/05-scheduler-full.t flaky size:M to [sporadic] CircleCI: openQA t/05-scheduler-full.t flaky size:M
Updated by tinita over 3 years ago
- Copied to deleted (action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M)
Updated by tinita over 3 years ago
- Related to action #95851: [sporadic] CircleCI: openQA t/43-scheduling-and-worker-scalability.t flaky size:M added
Updated by tinita over 3 years ago
- Copied from deleted (action #95839: [sporadic] CircleCI: openQA t/ui/14-dashboard.t flaky size:M)
Updated by ilausuch over 3 years ago
Checking on the error lines I found that
For the line 220 we have an active waiting
for (0 .. 100) {
last if $jobs->find(99982)->state eq OpenQA::Jobs::Constants::DONE;
sleep .2;
}
I created this PR https://github.com/os-autoinst/openQA/pull/4074 that increase the timeout 10 seconds for this case.
This waits 20 seconds. Maybe on the cases that there is more load that usual then we don't wait enough time.
Updated by ilausuch over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to ilausuch
Updated by ilausuch over 3 years ago
The problem here is how to ensure that the tests we run can prove that this works when the load is high
Updated by ilausuch over 3 years ago
About the second problem on line 133. I am not sure how it could happens tanking in consideration the previous checks. Unless the ->schedule() method returns too early a result and we have to wait.
Updated by tinita over 3 years ago
to me the failure in line 133 looks like the same problem, it calls wait_for_worker
a few lines above.
Updated by openqa_review over 3 years ago
- Due date set to 2021-08-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ilausuch over 3 years ago
- Due date deleted (
2021-08-10) - Status changed from In Progress to Workable
I leave this ticket. No more ideas
Updated by okurz over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Updated by okurz over 3 years ago
checking stability in https://github.com/os-autoinst/openQA/pull/4085
Updated by okurz over 3 years ago
tests failed in 1/4 runs
…Rerun 4 of 100 …
[12:45:46] t/05-scheduler-full.t ..................... 3/?
# Failed test 'sub process openqa-webapi terminated with exit code 9'
# at /home/squamata/project/t/lib/OpenQA/Test/Utils.pm line 223.
A context appears to have been destroyed without first calling release().
Based on $@ it does not look like an exception was thrown (this is not always
a reliable test)
This is a problem because the global error variables ($!, $@, and $?) will
not be restored. In addition some release callbacks will not work properly from
inside a DESTROY method.
Here are the context creation details, just in case a tool forgot to call
release():
File: t/05-scheduler-full.t
Line: 277
Tool: Test::More::subtest
Cleaning up the CONTEXT stack...
# Test ended with extra hubs on the stack!
# Tests were run but no plan was declared and done_testing() was not seen.
[12:45:46] t/05-scheduler-full.t ..................... Dubious, test returned 255 (wstat 65280, 0xff00)
All 3 subtests passed
which again looks to be a different problem. t/lib/OpenQA/Test/Utils.pm line 223.
is sub _fail_and_exit { fail shift; done_testing; exit shift }
Locally I could not yet reproduce a problem trying with FULLSTACK=1 SCHEDULER_FULLSTACK=1 count_fail_ratio prove -l t/05-scheduler-full.t
. Will try to go higher and with coverage collection overhead. Trying with make coverage FULLSTACK=1 SCHEDULER_FULLSTACK=1 RETRY=200 STABILITY_TEST=1 KEEP_DB=1 TESTS=t/05-scheduler-full.t
.
This actually managed to reproduce the first, originally reported issue:
not ok 6 - running job set to done if its worker re-connects claiming not to work on it anymore
# Failed test 'running job set to done if its worker re-connects claiming not to work on it anymore'
# at t/05-scheduler-full.t line 220.
# got: 'running'
# expected: 'done'
not ok 7 - running job incompleted if its worker re-connects claiming not to work on it anymore
# Failed test 'running job incompleted if its worker re-connects claiming not to work on it anymore'
# at t/05-scheduler-full.t line 222.
# got: 'none'
# expected: 'incomplete'
not ok 8 - reason is set
# Failed test 'reason is set'
# at t/05-scheduler-full.t line 224.
# undef
# doesn't match '(?^:abandoned: associated worker .+:\d+ re-connected but abandoned the job)'
[info] [pid:19210] Worker 5 websocket connection closed - 1006
1..8
# Looks like you failed 6 tests of 8.
not ok 3 - re-scheduling and incompletion of jobs when worker rejects jobs or goes offline
In another run I could reproduce the second error in rerun 12.
Updated by okurz over 3 years ago
- Status changed from In Progress to Feedback
As these seem to be the same problems we already had I suggest we simply bump the numbers of retries: https://github.com/os-autoinst/openQA/pull/4085
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved
I merged https://github.com/os-autoinst/openQA/pull/4085 . This will allow 5 instead of just 3 retries for the fullstack unstable tests. I also did https://github.com/os-autoinst/openQA/pull/4084 as further improvement. So we can close this ticket with the fix and the improvement present.