action #95024: openQA test t/ui/26-jobs_restart.t very unstable (already marked as unstable) size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #95024

closed

openQA test t/ui/26-jobs_restart.t very unstable (already marked as unstable) size:M

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-07-02

Due date:

2021-07-22

% Done:

Estimated time:

Description

Observation¶

https://app.circleci.com/pipelines/github/os-autoinst/openQA/6870/workflows/d884128b-fac7-4852-91b2-baf739f648a4/jobs/64708?invite=true#step-108-99 shows

RETRY=5 timeout -s SIGINT -k 5 -v $((5 * (5 + 1) ))m tools/retry prove -l --harness TAP::Harness::JUnit --timer t/ui/26-jobs_restart.t
Retry 1 of 5 …
[19:34:35] t/ui/26-jobs_restart.t ..       All 10 subtests passed 
[19:35:36]

Test Summary Report
-------------------
t/ui/26-jobs_restart.t (Wstat: 14 Tests: 10 Failed: 0)
  Non-zero wait status: 14
Files=1, Tests=10, 60.7488 wallclock secs ( 0.36 usr  0.05 sys + 49.70 cusr  1.54 csys = 51.65 CPU)
Result: FAIL
Retry 2 of 5 …
[19:35:37] t/ui/26-jobs_restart.t ..       All 10 subtests passed 
[19:36:37]

Test Summary Report
-------------------
t/ui/26-jobs_restart.t (Wstat: 14 Tests: 10 Failed: 0)
  Non-zero wait status: 14
Files=1, Tests=10, 60.8001 wallclock secs ( 0.36 usr  0.03 sys + 49.43 cusr  1.59 csys = 51.41 CPU)
Result: FAIL
Retry 3 of 5 …
[19:36:38] t/ui/26-jobs_restart.t ..       All 10 subtests passed 
[19:37:39]

Test Summary Report
-------------------
t/ui/26-jobs_restart.t (Wstat: 14 Tests: 10 Failed: 0)
  Non-zero wait status: 14
Files=1, Tests=10, 60.7867 wallclock secs ( 0.36 usr  0.02 sys + 49.28 cusr  1.55 csys = 51.21 CPU)
Result: FAIL
Retry 4 of 5 …
[19:37:40] t/ui/26-jobs_restart.t ..       All 10 subtests passed 
[19:38:41]

Test Summary Report
-------------------
t/ui/26-jobs_restart.t (Wstat: 14 Tests: 10 Failed: 0)
  Non-zero wait status: 14
Files=1, Tests=10, 60.8044 wallclock secs ( 0.35 usr  0.03 sys + 49.73 cusr  1.48 csys = 51.59 CPU)
Result: FAIL
Retry 5 of 5 …
[19:38:42] t/ui/26-jobs_restart.t ..       All 10 subtests passed 
[19:39:42]

Test Summary Report
-------------------
t/ui/26-jobs_restart.t (Wstat: 14 Tests: 10 Failed: 0)
  Non-zero wait status: 14
Files=1, Tests=10, 60.8018 wallclock secs ( 0.38 usr  0.02 sys + 49.00 cusr  1.59 csys = 50.99 CPU)
Result: FAIL
make[2]: *** [Makefile:188: test-unit-and-integration] Error 1
make[2]: Leaving directory '/home/squamata/project'
make[1]: *** [Makefile:183: test-with-database] Error 2
make[1]: Leaving directory '/home/squamata/project'
make: *** [Makefile:168: test-unstable] Error 2

Exited with code exit status 2

Expected result¶

At least back to 1/3 failures, not more, better less than 0/100 failures

Suggestions¶

The tests report "All 10 subtests passed" but then fail with "Non-zero wait status: 14". We had similar cases in the past. This has likely something to do with the cleanup of background processes. Maybe we introduced a regression lately so that this background handling behaves different

I suggest to try to reproduce locally, bisect between "last good" and "first bad" to find the culprit
docs/Contributing.asciidoc explains that "Non-zero wait status: 14" just means that the test consistently times out now so something makes it super-slow or the test effectively never ends. This can be easily checked locally by just executing the test some times

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz almost 4 years ago

Subject changed from openQA test t/ui/26-jobs_restart.t very unstable (already marked as unstable) to openQA test t/ui/26-jobs_restart.t very unstable (already marked as unstable) size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz almost 4 years ago

Related to action #95009: unstable/flaky test t/ui/12-needle-edit.t size:M added

Actions

Copy link

Updated by ilausuch almost 4 years ago

Status changed from Workable to In Progress
Assignee set to ilausuch

Actions

Copy link

Updated by ilausuch almost 4 years ago

I am having an other kind of error

Bailout called.  Further testing stopped:  clickElement: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/5.32.1/Test2/Hub/Subtest.pm line 67. at /home/openqa/openQA-dev/repos/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
Bail out!  clickElement: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/5.32.1/Test2/Hub/Subtest.pm line 67. at /home/openqa/openQA-dev/repos/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95

Actions

Copy link

Updated by okurz almost 4 years ago

I have seen the same in other test modules I think. Likely this is also a symptom of recent slowdown. Seems like something in 2021-06 made computers slow everywhere :( Be sure to be rebased on recent changes of timeout bumps by tinita.

Actions

Copy link

Updated by tinita almost 4 years ago

@ilausuch if this is what you get locally, try disabling the time limit. I have the feeling that this kind of error is related to our timeout settings.

Actions

Copy link

Updated by ilausuch almost 4 years ago

Yes is a timeout problem, if I add 10s extra it works. But I cannot get the described error locally

Actions

Copy link

Updated by ilausuch almost 4 years ago

I cannot reproduce this error locally, any ideas?

Actions

Copy link

Updated by okurz almost 4 years ago

well, have you read the suggestions? We know that the error in production means "timeout exceeded" and you just confirmed with the help of Tina that you also run into timeouts. As the suggestions state we should try to bisect to find out if any of our changes within our repo caused a slow-down, assuming you can identify a "last good". If you can't then we can accept the hypothesis of "other dependant factors that we can't control made everything slower" and we just bump the timeout in the test file.

Actions

Copy link

#10

Updated by tinita almost 4 years ago

It doesn't look like the error in CircleCI is a timeout or TimeLimit issue. Looking at the timestamps, the test doesn't run very long.
But it could of course be a different kind of timeout.
Something that happens at the end when Test::More has already reported that all tests were successful, but still the exit code is non-zero.

Maybe ssh into CircleCI to play around would be helpful. (As soon as CircleCI starts running actual tests again).

Actions

Copy link

#11

Updated by tinita almost 4 years ago

Oh wait, I didn't look at the TimeLimit value. It's 10. So yeah, bumping the timeout would be my suggestion.

Actions

Copy link

#12

Updated by okurz almost 4 years ago

tinita wrote:

It doesn't look like the error in CircleCI is a timeout or TimeLimit issue. Looking at the timestamps, the test doesn't run very long.

Have you read https://progress.opensuse.org/issues/95024#Suggestions ? We already found out that "Non-zero wait status: 14" effectively means that the program ends after handling SIGALRM, which means "test timeout" for us.

Actions

Copy link

#13

Updated by openqa_review almost 4 years ago

Due date set to 2021-07-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#14

Updated by okurz almost 4 years ago

Assignee changed from ilausuch to okurz
Priority changed from Urgent to High

As suggested by ilausuch I am taking over for him.

Seems this does not fail in CI anymore since recent timeout bumps. I am investigating locally how long the test takes now.

Without changes in current state locally count_fail_ratio prove -l t/ui/26-jobs_restart.t

## Run 1
t/ui/26-jobs_restart.t .. 4/? Bailout called.  Further testing stopped:  findElement: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
FAILED--Further testing stopped: findElement: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
## count_fail_ratio: Run: 1. Fails: 1. Fail ratio 100%
## Run 2
t/ui/26-jobs_restart.t .. 5/? Bailout called.  Further testing stopped:  executeScript: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
FAILED--Further testing stopped: executeScript: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
## count_fail_ratio: Run: 2. Fails: 2. Fail ratio 100%
## Run 3
t/ui/26-jobs_restart.t .. 4/? Bailout called.  Further testing stopped:  getElementText: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
FAILED--Further testing stopped: getElementText: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
## count_fail_ratio: Run: 3. Fails: 3. Fail ratio 100%

Bumping timeout from 10s to 20s:

## Run 1
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 16 wallclock secs ( 0.11 usr  0.01 sys +  6.60 cusr  0.55 csys =  7.27 CPU)
Result: PASS
## count_fail_ratio: Run: 1. Fails: 0. Fail ratio 0%
## Run 2
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 17 wallclock secs ( 0.13 usr  0.00 sys +  6.63 cusr  0.61 csys =  7.37 CPU)
Result: PASS
## count_fail_ratio: Run: 2. Fails: 0. Fail ratio 0%
## Run 3
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 17 wallclock secs ( 0.12 usr  0.01 sys +  6.72 cusr  0.60 csys =  7.45 CPU)
Result: PASS
## count_fail_ratio: Run: 3. Fails: 0. Fail ratio 0%
## Run 4
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 16 wallclock secs ( 0.11 usr  0.02 sys +  6.73 cusr  0.51 csys =  7.37 CPU)
Result: PASS
## count_fail_ratio: Run: 4. Fails: 0. Fail ratio 0%
## Run 5
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 17 wallclock secs ( 0.12 usr  0.01 sys +  6.61 cusr  0.57 csys =  7.31 CPU)
Result: PASS
## count_fail_ratio: Run: 5. Fails: 0. Fail ratio 0%

Crosscheck on an older commit state of openQA, 80f3d69ae from 2021-06-11 reproduces the timeout problems so our codebase did not recently introduce such problems. Reproduced again on 627798bec from 2021-03-05 the last time when I was working on the same test file myself:

## Run 1
t/ui/26-jobs_restart.t .. 2/? Bailout called.  Further testing stopped:  getElementText: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:93
FAILED--Further testing stopped: getElementText: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:93
## count_fail_ratio: Run: 1. Fails: 1. Fail ratio 100%
## Run 2
t/ui/26-jobs_restart.t .. 6/? Bailout called.  Further testing stopped:  close: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:93
FAILED--Further testing stopped: close: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:93
## count_fail_ratio: Run: 2. Fails: 2. Fail ratio 100%
## Run 3
t/ui/26-jobs_restart.t .. 5/? Bailout called.  Further testing stopped:  switchToWindow: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:93
FAILED--Further testing stopped: switchToWindow: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:93
## count_fail_ratio: Run: 3. Fails: 3. Fail ratio 100%
## Run 4
t/ui/26-jobs_restart.t .. 6/? Bailout called.  Further testing stopped:  findElement: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:93
FAILED--Further testing stopped: findElement: Server returned error message Label not found for "last T2_SUBTEST_WRAPPER" at /usr/lib/perl5/vendor_perl/5.26.1/Test2/Hub/Subtest.pm line 67. at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:93
## count_fail_ratio: Run: 4. Fails: 4. Fail ratio 100%

so it's slower on CI and on other people's computers as well as mine (Leap 15.2, not quite up-to-date right now). So most likely hypothesis (not entirely serious) is that the climate change caused this.

https://github.com/os-autoinst/openQA/pull/4020

Still running a longer statistics evaluation in my local env with runs=100 count_fail_ratio prove -l t/ui/26-jobs_restart.t to decide if we can mark the test as stable again for CI. Testing for CI in https://github.com/os-autoinst/openQA/pull/4021

In some cases in between I got

## Run 36
t/ui/26-jobs_restart.t .. 2/? 
        #   Failed test 'job is marked as restarted'
        #   at t/ui/26-jobs_restart.t line 125.
        #          got: 'minimalx@32bit'
        #     expected: 'minimalx@32bit (restarted)'
Bailout called.  Further testing stopped:  findChildElement: no such element: Unable to locate element: {"method":"xpath","selector":"./a[@title='new test']"} at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
FAILED--Further testing stopped: findChildElement: no such element: Unable to locate element: {"method":"xpath","selector":"./a[@title='new test']"} at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
## count_fail_ratio: Run: 36. Fails: 1. Fail ratio 2%
## Run 37
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 19 wallclock secs ( 0.13 usr  0.01 sys +  7.82 cusr  0.77 csys =  8.73 CPU)
Result: PASS
## count_fail_ratio: Run: 37. Fails: 1. Fail ratio 2%
## Run 38
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 18 wallclock secs ( 0.13 usr  0.01 sys +  7.87 cusr  0.67 csys =  8.68 CPU)
Result: PASS
## count_fail_ratio: Run: 38. Fails: 1. Fail ratio 2%
## Run 39
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 20 wallclock secs ( 0.12 usr  0.02 sys +  8.20 cusr  0.72 csys =  9.06 CPU)
Result: PASS
## count_fail_ratio: Run: 39. Fails: 1. Fail ratio 2%
## Run 40
t/ui/26-jobs_restart.t .. ok    
All tests successful.
Files=1, Tests=10, 19 wallclock secs ( 0.12 usr  0.02 sys +  8.14 cusr  0.76 csys =  9.04 CPU)
Result: PASS
## count_fail_ratio: Run: 40. Fails: 1. Fail ratio 2%
## Run 41
t/ui/26-jobs_restart.t .. 3/? 
        #   Failed test 'job is marked as restarted'
        #   at t/ui/26-jobs_restart.t line 125.
        #          got: 'minimalx@32bit'
        #     expected: 'minimalx@32bit (restarted)'
Bailout called.  Further testing stopped:  findChildElement: no such element: Unable to locate element: {"method":"xpath","selector":"./a[@title='new test']"} at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
FAILED--Further testing stopped: findChildElement: no such element: Unable to locate element: {"method":"xpath","selector":"./a[@title='new test']"} at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
## count_fail_ratio: Run: 41. Fails: 2. Fail ratio 4%
## Run 42
t/ui/26-jobs_restart.t .. 2/? 
        #   Failed test 'job is marked as restarted'
        #   at t/ui/26-jobs_restart.t line 125.
        #          got: 'minimalx@32bit'
        #     expected: 'minimalx@32bit (restarted)'
Bailout called.  Further testing stopped:  findChildElement: no such element: Unable to locate element: {"method":"xpath","selector":"./a[@title='new test']"} at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
FAILED--Further testing stopped: findChildElement: no such element: Unable to locate element: {"method":"xpath","selector":"./a[@title='new test']"} at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
## count_fail_ratio: Run: 42. Fails: 3. Fail ratio 7%
## Run 43
t/ui/26-jobs_restart.t .. 2/? 
        #   Failed test 'job is marked as restarted'
        #   at t/ui/26-jobs_restart.t line 125.
        #          got: 'minimalx@32bit'
        #     expected: 'minimalx@32bit (restarted)'
Bailout called.  Further testing stopped:  findChildElement: no such element: Unable to locate element: {"method":"xpath","selector":"./a[@title='new test']"} at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95
FAILED--Further testing stopped: findChildElement: no such element: Unable to locate element: {"method":"xpath","selector":"./a[@title='new test']"} at /home/okurz/local/os-autoinst/openQA/t/ui/../lib/OpenQA/SeleniumTest.pm:95

Actions

Copy link

#15