action #59187
closedunstable/flaky test: ui/15-admin-workers.t
Description
Observation¶
From https://circleci.com/gh/os-autoinst/openQA/4398
while :; do\
[ $n -lt "3" ] || exit 1;\
[ $n -eq 0 ] || echo Retrying...;\
prove -l --harness TAP::Harness::JUnit t/ui/15-admin-workers.t && break;\
n=$[$n+1];\
done
t/ui/15-admin-workers.t .. 6/?
# Failed test 'still on 99963'
# at t/ui/15-admin-workers.t line 161.
# 'All Tests
# Job Groups
# Logged in as Demo
# Worker localhost:1
# Host: localhost
# Instance: 1
# Alive: yes
# Websocket connection: Active
# Seen: about 2 hours ago
# Status: Working
# Properties
# Key Value
# JOBTOKEN token99963
# Previous jobs
# Show
# 10
# 25
# 50
# 100
# entries
# Test Result Finished
# opensuse-13.1-NET-x86_64-Build0091-kde@64bit not yet
# opensuse-Factory-staging_e-x86_64-Build87.5011-minimalx@32bit 0 about an hour ago
# Showing 1 to 2 of 2 entries
# Previous
# 1
# Next
# openQA is licensed GPL-2.0 - Version git-4.5.1528009330.e68ebe2b-79820978'
# doesn't match '(?^:Worker status\nJob: 99963)'
t/ui/15-admin-workers.t .. 18/? # Looks like you failed 1 test of 19.
t/ui/15-admin-workers.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/19 subtests
Test Summary Report
-------------------
t/ui/15-admin-workers.t (Wstat: 256 Tests: 19 Failed: 1)
Failed test: 10
Non-zero exit status: 1
Files=1, Tests=19, 28.0708 wallclock secs ( 0.04 usr 0.01 sys + 16.63 cusr 0.79 csys = 17.47 CPU)
Result: FAIL
Retrying...
t/ui/15-admin-workers.t .. 1/?
# Failed test 'working 99961'
# at t/ui/15-admin-workers.t line 127.
# ''
# doesn't match '(?^:Worker status\nJob: 99961)'
# Looks like you failed 1 test of 15.
t/ui/15-admin-workers.t .. 6/?
# Failed test 'worker overview'
# at t/ui/15-admin-workers.t line 139.
t/ui/15-admin-workers.t .. 7/? # Looks like you failed 1 test of 19.
t/ui/15-admin-workers.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/19 subtests
Test Summary Report
-------------------
t/ui/15-admin-workers.t (Wstat: 256 Tests: 19 Failed: 1)
Failed test: 6
Non-zero exit status: 1
Files=1, Tests=19, 27.0958 wallclock secs ( 0.04 usr 0.00 sys + 16.08 cusr 0.66 csys = 16.78 CPU)
Result: FAIL
Retrying...
t/ui/15-admin-workers.t .. 10/?
# Failed test 'still on 99963'
# at t/ui/15-admin-workers.t line 161.
# 'All Tests
# Job Groups
# Logged in as Demo
# Worker localhost:1
# Host: localhost
# Instance: 1
# Alive: yes
# Websocket connection: Active
# Seen: about 2 hours ago
# Status: Working
# Properties
# Key Value
# JOBTOKEN token99963
# Previous jobs
# Show
# 10
# 25
# 50
# 100
# entries
# Test Result Finished
# opensuse-13.1-NET-x86_64-Build0091-kde@64bit not yet
# opensuse-Factory-staging_e-x86_64-Build87.5011-minimalx@32bit 0 about an hour ago
# Showing 1 to 2 of 2 entries
# Previous
# 1
# Next
# openQA is licensed GPL-2.0 - Version git-4.5.1528009330.e68ebe2b-79820978'
# doesn't match '(?^:Worker status\nJob: 99963)'
# Looks like you failed 1 test of 19.
t/ui/15-admin-workers.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/19 subtests
Test Summary Report
-------------------
t/ui/15-admin-workers.t (Wstat: 256 Tests: 19 Failed: 1)
Failed test: 10
Non-zero exit status: 1
so I see three retries where in the first and third there is full output corresponding to the worker page but not with the expected string "Worker status…". In the second example the page actually does not seem to return any results. The text "Worker status" comes from https://github.com/os-autoinst/openQA/blob/master/templates/admin/workers/worker_status.html.ep#L12
Updated by okurz about 5 years ago
- Status changed from New to In Progress
- Assignee set to okurz
I can see that andriinikitin is also working on this already with https://github.com/os-autoinst/openQA/pull/2382 . I am currently calling for i in {1..100}; do echo "TRY $i" && make PROVE_ARGS="-v t/ui/15-admin-workers.t" test-with-database || break ; done
and could not reproduce the problem within 51 runs, aborted early. Maybe it fails when I slow it down with nice -n 19
EDIT: 100 times with nice -n 19
and also not failed once. So it only fails in circle CI somehow? Or in our container. At least a single run of make CHECKSTYLE=0 PROVE_ARGS="t/ui/15-admin-workers.t" UITEST=1 prepare-and-launch-docker-to-run-tests-within
executes t/ui/15-admin-workers.t just fine
Updated by andriinikitin about 5 years ago
You can "retry with ssh" unstable job in circleci and try the same command there - I believe it will fail quickly.
Updated by okurz about 5 years ago
- Status changed from In Progress to New
- Assignee deleted (
okurz)
what Andrii wrote could be tried. Did not yet do it myself.
Updated by okurz about 5 years ago
- Status changed from New to Feedback
- Assignee set to okurz
- Target version set to Current Sprint
https://github.com/os-autoinst/openQA/pull/2485
EDIT: Does not seem to be enough.
So I restarted a failed circle CI test using SSH and in there did:
for i in {1..100}; do echo "TRY: $i" && make PROVE_ARGS="-v t/ui/15-admin-workers.t" test-with-database || break; done
which fails on the first or second retry.
With changes in three locations like:
diff --git a/t/ui/15-admin-workers.t b/t/ui/15-admin-workers.t
index 492a3acc5..9660dd67b 100644
--- a/t/ui/15-admin-workers.t
+++ b/t/ui/15-admin-workers.t
@@ -115,6 +115,8 @@ subtest 'worker overview' => sub {
# check worker 1
is($driver->find_element('tr#worker_1 .worker')->get_text(), 'localhost:1', 'localhost:1 shown');
$driver->find_element('tr#worker_1 .help_popover')->click();
+ wait_for_element(selector => '.popover', description => 'worker status popover is displayed');
+ wait_for_ajax;
like($driver->find_element('.popover')->get_text(), qr/Worker status\nJob: 99963/, 'on 99963');
# close the popover and wait until it is actually gone (seems to have a very short animation)
this is more stable and I need both the wait_for_element
and wait_for_ajax
but still a failure after 18 runs:
TRY: 18
test -d /dev/shm/tpg && (pg_ctl -D /dev/shm/tpg -s status >&/dev/null || pg_ctl -D /dev/shm/tpg -s start) || ./t/test_postgresql /dev/shm/tpg
…
export GLOBIGNORE=""; prove -l -v t/ui/15-admin-workers.t
t/ui/15-admin-workers.t ..
…
ok 8 - on worker 1
[debug] [eeb0ec09] Routing to controller "OpenQA::WebAPI::Controller::Admin::Workers" and action "previous_jobs_ajax"
ok 9 - table properties shown
[debug] [eeb0ec09] 200 OK (0.011117s, 89.952/s)
ok 10 - worker status popover is displayed
ok 11 - Wait for jQuery successful
not ok 12 - still on 99963
# Failed test 'still on 99963'
# at t/ui/15-admin-workers.t line 165.
# ''
# doesn't match '(?^:Worker status\nJob: 99963)'
only after removing the third part about "still on 99963" which shouldn't actually be needed I could make the test stable.
Updated by okurz about 5 years ago
The PR or actually the commit is in master. Not sure if I did this by mistake or someone else though.
Updated by okurz about 5 years ago
remove from list of unstable tests with https://github.com/os-autoinst/openQA/pull/2494
Updated by okurz about 5 years ago
- Status changed from Feedback to Resolved
- Target version changed from Current Sprint to Done
Updated by tinita about 3 years ago
I saw this failure just now:
https://app.circleci.com/pipelines/github/os-autoinst/openQA/8595/workflows/53935253-041c-428a-9855-069db2ee84dd/jobs/81201/steps
[16:35:41] t/ui/15-admin-workers.t .................... 5/?
# Failed test 'on 99963'
# at t/ui/15-admin-workers.t line 119.
# ''
# doesn't match '(?^:Worker status\nJob: 99963)'
# Looks like you failed 1 test of 19.
[16:35:41] t/ui/15-admin-workers.t .................... 7/?
# Failed test 'worker overview'
# at t/ui/15-admin-workers.t line 142.
[16:35:41] t/ui/15-admin-workers.t .................... 8/?
[16:35:41] t/ui/15-admin-workers.t .................... 15/?
[16:35:41] t/ui/15-admin-workers.t .................... 17/?
# Looks like you failed 1 test of 17.
[16:35:41] t/ui/15-admin-workers.t .................... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/17 subtests