action #59187

unstable/flaky test: ui/15-admin-workers.t

Added by okurz 5 months ago. Updated 5 months ago.

Status:ResolvedStart date:07/11/2019
Priority:NormalDue date:
Assignee:okurz% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

Observation

From https://circleci.com/gh/os-autoinst/openQA/4398

while :; do\
    [ $n -lt "3" ] || exit 1;\
    [ $n -eq 0 ] || echo Retrying...;\
    prove -l --harness TAP::Harness::JUnit t/ui/15-admin-workers.t && break;\
    n=$[$n+1];\
done
t/ui/15-admin-workers.t .. 6/? 
#   Failed test 'still on 99963'
#   at t/ui/15-admin-workers.t line 161.
#                   'All Tests
# Job Groups
# Logged in as Demo
# Worker localhost:1
# Host: localhost
# Instance: 1
# Alive: yes
# Websocket connection: Active
# Seen: about 2 hours ago
# Status: Working
# Properties
# Key Value
# JOBTOKEN token99963
# Previous jobs
# Show
# 10
# 25
# 50
# 100
# entries
# Test Result Finished
# opensuse-13.1-NET-x86_64-Build0091-kde@64bit not yet
# opensuse-Factory-staging_e-x86_64-Build87.5011-minimalx@32bit 0 about an hour ago
# Showing 1 to 2 of 2 entries
# Previous
# 1
# Next
# openQA is licensed GPL-2.0 - Version git-4.5.1528009330.e68ebe2b-79820978'
#     doesn't match '(?^:Worker status\nJob: 99963)'
t/ui/15-admin-workers.t .. 18/? # Looks like you failed 1 test of 19.
t/ui/15-admin-workers.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/19 subtests 

Test Summary Report
-------------------
t/ui/15-admin-workers.t (Wstat: 256 Tests: 19 Failed: 1)
  Failed test:  10
  Non-zero exit status: 1
Files=1, Tests=19, 28.0708 wallclock secs ( 0.04 usr  0.01 sys + 16.63 cusr  0.79 csys = 17.47 CPU)
Result: FAIL
Retrying...
t/ui/15-admin-workers.t .. 1/? 
    #   Failed test 'working 99961'
    #   at t/ui/15-admin-workers.t line 127.
    #                   ''
    #     doesn't match '(?^:Worker status\nJob: 99961)'
    # Looks like you failed 1 test of 15.
t/ui/15-admin-workers.t .. 6/? 
#   Failed test 'worker overview'
#   at t/ui/15-admin-workers.t line 139.
t/ui/15-admin-workers.t .. 7/? # Looks like you failed 1 test of 19.
t/ui/15-admin-workers.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/19 subtests 

Test Summary Report
-------------------
t/ui/15-admin-workers.t (Wstat: 256 Tests: 19 Failed: 1)
  Failed test:  6
  Non-zero exit status: 1
Files=1, Tests=19, 27.0958 wallclock secs ( 0.04 usr  0.00 sys + 16.08 cusr  0.66 csys = 16.78 CPU)
Result: FAIL
Retrying...
t/ui/15-admin-workers.t .. 10/? 
#   Failed test 'still on 99963'
#   at t/ui/15-admin-workers.t line 161.
#                   'All Tests
# Job Groups
# Logged in as Demo
# Worker localhost:1
# Host: localhost
# Instance: 1
# Alive: yes
# Websocket connection: Active
# Seen: about 2 hours ago
# Status: Working
# Properties
# Key Value
# JOBTOKEN token99963
# Previous jobs
# Show
# 10
# 25
# 50
# 100
# entries
# Test Result Finished
# opensuse-13.1-NET-x86_64-Build0091-kde@64bit not yet
# opensuse-Factory-staging_e-x86_64-Build87.5011-minimalx@32bit 0 about an hour ago
# Showing 1 to 2 of 2 entries
# Previous
# 1
# Next
# openQA is licensed GPL-2.0 - Version git-4.5.1528009330.e68ebe2b-79820978'
#     doesn't match '(?^:Worker status\nJob: 99963)'
# Looks like you failed 1 test of 19.
t/ui/15-admin-workers.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/19 subtests 

Test Summary Report
-------------------
t/ui/15-admin-workers.t (Wstat: 256 Tests: 19 Failed: 1)
  Failed test:  10
  Non-zero exit status: 1

so I see three retries where in the first and third there is full output corresponding to the worker page but not with the expected string "Worker status…". In the second example the page actually does not seem to return any results. The text "Worker status" comes from https://github.com/os-autoinst/openQA/blob/master/templates/admin/workers/worker_status.html.ep#L12

History

#1 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I can see that andriinikitin is also working on this already with https://github.com/os-autoinst/openQA/pull/2382 . I am currently calling for i in {1..100}; do echo "TRY $i" && make PROVE_ARGS="-v t/ui/15-admin-workers.t" test-with-database || break ; done and could not reproduce the problem within 51 runs, aborted early. Maybe it fails when I slow it down with nice -n 19

EDIT: 100 times with nice -n 19 and also not failed once. So it only fails in circle CI somehow? Or in our container. At least a single run of make CHECKSTYLE=0 PROVE_ARGS="t/ui/15-admin-workers.t" UITEST=1 prepare-and-launch-docker-to-run-tests-within executes t/ui/15-admin-workers.t just fine

#2 Updated by andriinikitin 5 months ago

You can "retry with ssh" unstable job in circleci and try the same command there - I believe it will fail quickly.

#3 Updated by okurz 5 months ago

  • Status changed from In Progress to New
  • Assignee deleted (okurz)

what Andrii wrote could be tried. Did not yet do it myself.

#4 Updated by okurz 5 months ago

  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint

https://github.com/os-autoinst/openQA/pull/2485

EDIT: Does not seem to be enough.

So I restarted a failed circle CI test using SSH and in there did:

for i in {1..100}; do echo "TRY: $i" && make PROVE_ARGS="-v t/ui/15-admin-workers.t" test-with-database || break; done

which fails on the first or second retry.

With changes in three locations like:

diff --git a/t/ui/15-admin-workers.t b/t/ui/15-admin-workers.t
index 492a3acc5..9660dd67b 100644
--- a/t/ui/15-admin-workers.t
+++ b/t/ui/15-admin-workers.t
@@ -115,6 +115,8 @@ subtest 'worker overview' => sub {
     # check worker 1
     is($driver->find_element('tr#worker_1 .worker')->get_text(), 'localhost:1', 'localhost:1 shown');
     $driver->find_element('tr#worker_1 .help_popover')->click();
+    wait_for_element(selector => '.popover', description => 'worker status popover is displayed');
+    wait_for_ajax;
     like($driver->find_element('.popover')->get_text(), qr/Worker status\nJob: 99963/, 'on 99963');

     # close the popover and wait until it is actually gone (seems to have a very short animation)

this is more stable and I need both the wait_for_element and wait_for_ajax but still a failure after 18 runs:

TRY: 18
test -d /dev/shm/tpg && (pg_ctl -D /dev/shm/tpg -s status >&/dev/null || pg_ctl -D /dev/shm/tpg -s start) || ./t/test_postgresql /dev/shm/tpg
…
export GLOBIGNORE=""; prove -l -v t/ui/15-admin-workers.t
t/ui/15-admin-workers.t .. 
…
ok 8 - on worker 1
[debug] [eeb0ec09] Routing to controller "OpenQA::WebAPI::Controller::Admin::Workers" and action "previous_jobs_ajax"
ok 9 - table properties shown
[debug] [eeb0ec09] 200 OK (0.011117s, 89.952/s)
ok 10 - worker status popover is displayed
ok 11 - Wait for jQuery successful
not ok 12 - still on 99963

#   Failed test 'still on 99963'
#   at t/ui/15-admin-workers.t line 165.
#                   ''
#     doesn't match '(?^:Worker status\nJob: 99963)'

only after removing the third part about "still on 99963" which shouldn't actually be needed I could make the test stable.

-> https://github.com/os-autoinst/openQA/pull/2485

#5 Updated by okurz 5 months ago

The PR or actually the commit is in master. Not sure if I did this by mistake or someone else though.

#6 Updated by okurz 5 months ago

remove from list of unstable tests with https://github.com/os-autoinst/openQA/pull/2494

#7 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

Also available in: Atom PDF