Project

General

Profile

coordination #98952

[epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errors

Added by okurz 2 months ago. Updated 6 days ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-09-21
Due date:
2021-12-11
% Done:

50%

Estimated time:
(Total: 0.00 h)
Difficulty:


Subtasks

action #102578: [sporadic] t/full-stack.t Failed test 'Expected result for job 1 not found' size:MIn Progressokurz

coordination #102581: Proof of concept of t/full-stack.t on GitHubActionsResolvedcdywan


Related issues

Related to openQA Project - action #101734: Prevent "called on undefined file descriptor" in myjsonrpc.pm:40 in openQA t/full-stack.tResolved2021-10-29

History

#1 Updated by okurz 2 months ago

  • Description updated (diff)

#2 Updated by okurz 2 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I consider the ticket "Workable" already as it was in before. Taking the ticket and setting to "In Progress" relating to the above PR to address urgency.

#3 Updated by okurz 2 months ago

  • Status changed from In Progress to New
  • Assignee deleted (okurz)
  • Priority changed from Urgent to High

https://github.com/os-autoinst/openQA/pull/4218 merged, back to backlog with lower prio to fix.

#4 Updated by okurz 2 months ago

  • Priority changed from High to Urgent

#5 Updated by cdywan 2 months ago

  • Subject changed from t/full-stack.t can timeout, "clickElement: element not interactable" to t/full-stack.t can timeout, "clickElement: element not interactable" size:M
  • Status changed from New to Workable

#6 Updated by okurz 2 months ago

https://app.circleci.com/pipelines/github/os-autoinst/openQA/7796/workflows/05bae277-19b8-450b-92d9-4f21eccb5ff4/jobs/73602/steps shows a failed overall test run after multiple consecutive failures:

[10:25:05] t/full-stack.t .. 19/?     # javascript console output, waiting for connection opened: [
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/8e216fcf83/test_result.js 287 WebSocket connection to 'ws://localhost:9528/liveviewhandler/tests/1/developer/ws-proxy/status' failed: Close received after close",
    #     source    => "network",
    #     timestamp => 1632392734180,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/usXVXLzXROijBH3t/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1632392734653,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/usXVXLzXROijBH3t/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1632392734835,
    #   },
    # ]
    # javascript console output, waiting for connection opened: [
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/usXVXLzXROijBH3t/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1632392735967,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/usXVXLzXROijBH3t/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1632392735967,
    #   },
    # ]
[10:25:05] t/full-stack.t .. 22/?     # javascript console output, waiting for paused: [
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/usXVXLzXROijBH3t/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1632392741038,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/usXVXLzXROijBH3t/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1632392741038,
    #   },
    # ]
[10:25:05] t/full-stack.t .. 25/? # javascript console output: [
#   {
#     level     => "SEVERE",
#     message   => "http://localhost:9526/asset/8e216fcf83/test_result.js 287 WebSocket connection to 'ws://localhost:9528/liveviewhandler/tests/1/developer/ws-proxy/status' failed: Close received after close",
#     source    => "network",
#     timestamp => 1632392742964,
#   },
# ]
[10:25:05] t/full-stack.t .. 33/? # stopped waiting for '(?^u:Result: passed)', result turned out to be 'failed'
# full result panel contents:
# 0 1
# Result: failed finished less than a minute ago ( 00:10 minutes )
# Reason: isotovideo done: Can't use an undefined value as a symbol reference at /home/squamata/os-autoinst/myjsonrpc.pm line 54.
# Scheduled product: job has not been created by posting an ISO
# Assigned worker: 03486c3d351a:1

Locally could not reproduce a failure:

$ make test KEEP_DB=1 TESTS=t/full-stack.t FULLSTACK=1
test -d /dev/shm/tpg && (pg_ctl -D /dev/shm/tpg -s status >&/dev/null || pg_ctl -D /dev/shm/tpg -s start) || ./t/test_postgresql /dev/shm/tpg
make test-unit-and-integration TEST_PG="DBI:Pg:dbname=openqa_test;host=/dev/shm/tpg"
make[1]: Entering directory '/home/okurz/local/os-autoinst/openQA'
export GLOBIGNORE="";\
export DEVEL_COVER_DB_FORMAT=JSON;\
export PERL5OPT=" -It/lib -I/home/okurz/local/os-autoinst/openQA/t/lib -MOpenQA::Test::PatchDeparse";\
RETRY=0 timeout -s SIGINT -k 5 -v $((60 * 1 * (0 + 1) ))m tools/retry prove -l --trap  t/full-stack.t
t/full-stack.t .. ok     
All tests successful.
Files=1, Tests=111, 176 wallclock secs ( 0.54 usr  0.02 sys + 51.49 cusr 11.38 csys = 63.43 CPU)
Result: PASS
make[1]: Leaving directory '/home/okurz/local/os-autoinst/openQA'
[ 1 = 1 ] || pg_ctl -D /dev/shm/tpg stop

#7 Updated by okurz 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

Gathering statistics

#8 Updated by okurz 2 months ago

Running count_fail_ratio make test KEEP_DB=1 TESTS=t/full-stack.t FULLSTACK=1 with count_fail_ratio locally I have:

## count_fail_ratio: Run: 20. Fails: 0. Fail ratio 0%

with a mean runtime of t/full-stack.t of roughly 150s. Maybe reproducible with coverage analysis calling count_fail_ratio make coverage KEEP_DB=1 TESTS=t/full-stack.t FULLSTACK=1:

## Run 4
export DEVEL_COVER_DB_FORMAT=JSON;\
COVERAGE=1 cover  -test
Deleting database /home/okurz/local/os-autoinst/openQA/cover_db
cover: running make test "OPTIMIZE=-O0 -fprofile-arcs -ftest-coverage" "OTHERLDFLAGS=-fprofile-arcs -ftest-coverage"
make[1]: Entering directory '/home/okurz/local/os-autoinst/openQA'
test -d /dev/shm/tpg && (pg_ctl -D /dev/shm/tpg -s status >&/dev/null || pg_ctl -D /dev/shm/tpg -s start) || ./t/test_postgresql /dev/shm/tpg
make test-unit-and-integration TEST_PG="DBI:Pg:dbname=openqa_test;host=/dev/shm/tpg"
make[2]: Entering directory '/home/okurz/local/os-autoinst/openQA'
export GLOBIGNORE="";\
export DEVEL_COVER_DB_FORMAT=JSON;\
export PERL5OPT="-mJSON::PP -MDevel::Cover=-select_re,'^/lib',+ignore_re,lib/perlcritic/Perl/Critic/Policy,-coverage,statement,-db,cover_db, -It/lib -I/home/okurz/local/os-autoinst/openQA/t/lib -MOpenQA::Test::PatchDeparse";\
RETRY=0 timeout -s SIGINT -k 5 -v $((60 * 1 * (0 + 1) ))m tools/retry prove -l --trap  t/full-stack.t
t/full-stack.t .. Failed 1/94 subtests 

Test Summary Report
-------------------
t/full-stack.t (Wstat: 0 Tests: 93 Failed: 0)
  Parse errors: Tests out of sequence.  Found (19) but expected (18)
                Tests out of sequence.  Found (20) but expected (19)
                Tests out of sequence.  Found (21) but expected (20)
                Tests out of sequence.  Found (22) but expected (21)
                Tests out of sequence.  Found (23) but expected (22)
Displayed the first 5 of 77 TAP syntax errors.
Re-run prove with the -p option to see them all.
Files=1, Tests=93, 303 wallclock secs ( 1.01 usr  0.03 sys + 199.11 cusr 15.47 csys = 215.62 CPU)
Result: FAIL

well, at least some failure. I don't see what step actually failed, strange. All other tests in this run passed so the overall statistics locally from the above command:

## count_fail_ratio: Run: 20. Fails: 1. Fail ratio 5%

Maybe this is about KVM? AFAIR in circleCI there is no KVM support so the full stack test sets the os-autoinst test variable QEMU_NO_KVM based on the line t/lib/OpenQA/Test/FullstackUtils.pm:$JOB_SETUP->{QEMU_NO_KVM} = '1' unless -r '/dev/kvm';

Running with QEMU_NO_KVM=1 count_fail_ratio make coverage KEEP_DB=1 TESTS=t/full-stack.t FULLSTACK=1

This was also fine with ## count_fail_ratio: Run: 7. Fails: 0. Fail ratio 0%

#9 Updated by openqa_review 2 months ago

  • Due date set to 2021-10-08

Setting due date based on mean cycle time of SUSE QE Tools

#10 Updated by okurz 2 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

https://github.com/os-autoinst/openQA/pull/4233 clearly reproduces the problem in CI. The problem seems to be about recent software changes. Also the "dependency PR" in the past days as visible in https://github.com/os-autoinst/openQA/pull/4232 is causing problems trying to update os-autoinst and deleting some dependencies. The full-stack.t log file then says "Can't locate Inline.pm in @INC" so it can not work.

I suspect a relation to
https://github.com/os-autoinst/os-autoinst/pull/1784
or
https://github.com/os-autoinst/os-autoinst/pull/1782

#11 Updated by cdywan 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

I'm taking this since it's affecting a lot of PRs. Let's see what I can come up with/

#12 Updated by cdywan 2 months ago

Not sure what's causing the Something Unexpected Happened error on CircleCI. At least one resut I can still see failed with Reason: isotovideo done: Can't use an undefined value as a symbol reference at /home/squamata/os-autoinst/myjsonrpc.pm line 54. so I'm adding DEBUG_JSON=1 to see if that makes the error more meaningful.

I'm also trying so far unsuccessfully to replicate the error locally (yes, I read all of the above comments).

#13 Updated by cdywan 2 months ago

I managed to reproduce in the meanwhile (using for ((i=1; i <= 20; i++)); do EXTRA_PROVE_ARGS=-v QEMU_NO_KVM=1 FULLSTACK=1 DEBUG_JSON=1 OPENQA_TEST_TIMEOUT_DISABLE=1 make test-with-database TESTS=t/full-stack.t || break; done) which makes me think the two container PR's are unrelated since the fullstack test isn't using containers afair.

#14 Updated by cdywan 2 months ago

Proposed an attempt at improving diagnostics to add context:

  • the job we're waiting on
  • subtests to allow failures to retain context
  • JSON errors from os-autoinst

See https://github.com/os-autoinst/openQA/pull/4243

#15 Updated by cdywan 2 months ago

Another failure in CI:

I'm trying to reproduce this one also by running tests on 15.3 but couldn't so far (note: this is distinct from the isotovideo issue above which I can sometimes reproduce on Tumbleweed), but I guess I'll try to change back the container anyway and see what I can learn from that.

Edit: I can reproduce it. I was stupidly using the wrong container w/o realizing it (this is why I hate errors disguised as warnings that get lost in scrollback).

Interestingly on Tumbleweed Inline-Python gets installed, on Leap 15.3 it doesn't with the exact same packages. Doesn't seem to affect the failure, though.

#16 Updated by cdywan 2 months ago

okurz wrote:

https://github.com/os-autoinst/openQA/pull/4233 clearly reproduces the problem in CI. The problem seems to be about recent software changes. Also the "dependency PR" in the past days as visible in https://github.com/os-autoinst/openQA/pull/4232 is causing problems trying to update os-autoinst and deleting some dependencies. The full-stack.t log file then says "Can't locate Inline.pm in @INC" so it can not work.

  • As indicated before I can't confirm that InlinePython is preventing the test from passing.
  • I proposed a PR to address the mismatching hash, although I'm not 100% sure why it happened.

#17 Updated by okurz 2 months ago

cdywan wrote:

Not sure what's causing the Something Unexpected Happened error on CircleCI

this was a problem on 2021-09-27, see https://status.circleci.com/ for the entry of that day: "[…] Right now, you might be unable to access the CircleCI UI. This is unfortunately a side-effect of our effort to fix the initial incident."

I'm also trying so far unsuccessfully to replicate the error locally (yes, I read all of the above comments).
[…]
I managed to reproduce in the meanwhile (using for ((i=1; i <= 20; i++)); do EXTRA_PROVE_ARGS=-v QEMU_NO_KVM=1 FULLSTACK=1 DEBUG_JSON=1 OPENQA_TEST_TIMEOUT_DISABLE=1 make test-with-database TESTS=t/full-stack.t || break; done)

just to verify, so you did manage to reproduce problems when running the above locally, right?

[…] which makes me think the two container PR's are unrelated since the fullstack test isn't using containers afair.

which "two container PR's"? And yes, if you just call make test TESTS=t/full-stack.t (or as you did above) the test will run in your local environment, no container.

#18 Updated by cdywan 2 months ago

okurz wrote:

cdywan wrote:

[…] which makes me think the two container PR's are unrelated since the fullstack test isn't using containers afair.

which "two container PR's"? And yes, if you just call make test TESTS=t/full-stack.t (or as you did above) the test will run in your local environment, no container.

I was referring to https://github.com/os-autoinst/os-autoinst/pull/1784 and https://github.com/os-autoinst/os-autoinst/pull/1782 respectively. I am able to reproduce all except for one error on Tumbleweed, with a certain probability after repeated executions.

I can reproduce tests died: unable to load main.pm, check the log for the cause reliably on Leap, although I still don't understand the cause because e.g. missing Python support doesn't affect the result.

#19 Updated by okurz 2 months ago

In https://app.circleci.com/pipelines/github/os-autoinst/openQA/7869/workflows/ea038df5-5d73-4559-8ac8-3ad19a68de73/jobs/74362?invite=true#step-104-963 for https://github.com/os-autoinst/openQA/pull/4251 I see that still an old version of os-autoinst-devel is installed

(263/264) Installing: os-autoinst-devel-4.6.1589880721.55a008dd-lp152.1.13.x86_64 [/](263/264) Installing: os-autoinst-devel-4.6.1589880721.55a008dd-lp152.1.13.x86_64 [done]

trying to reproduce this locally in the container registry.opensuse.org/devel/openqa/ci/containers/base:latest I see

2e14fdb3da3e:/opt/testing_area # zypper in --dry-run os-autoinst-devel
Loading repository data...
Reading installed packages...
Resolving package dependencies...

Problem: os-autoinst-devel-4.6.1632799442.f77d4e14-lp152.864.1.x86_64 requires qemu >= 4.0, but this requirement cannot be provided
  not installable providers: qemu-4.2.0-lp152.8.1.x86_64[repo-oss]
                   qemu-4.2.1-lp152.9.12.1.x86_64[repo-update]
                   qemu-4.2.1-lp152.9.16.2.x86_64[repo-update]
                   qemu-4.2.1-lp152.9.20.1.x86_64[repo-update]
                   qemu-4.2.1-lp152.9.3.1.x86_64[repo-update]
                   qemu-4.2.1-lp152.9.6.1.x86_64[repo-update]
                   qemu-4.2.1-lp152.9.9.2.x86_64[repo-update]
 Solution 1: Following actions will be done:
  install qemu-4.2.1-lp152.9.20.1.x86_64 (with vendor change)
    obs://build.opensuse.org/devel:openQA  -->  openSUSE
  install qemu-x86-4.2.1-lp152.9.20.1.x86_64 (with vendor change)
    obs://build.opensuse.org/devel:openQA  -->  openSUSE
  install qemu-kvm-4.2.1-lp152.9.20.1.x86_64 (with vendor change)
    obs://build.opensuse.org/devel:openQA  -->  openSUSE
 Solution 2: do not install os-autoinst-devel-4.6.1632799442.f77d4e14-lp152.864.1.x86_64
 Solution 3: break os-autoinst-devel-4.6.1632799442.f77d4e14-lp152.864.1.x86_64 by ignoring some of its dependencies

so there seems to be a problem about the qemu package dependency.

The container image we use for generating dependency update PRs is https://build.opensuse.org/package/show/devel:openQA:ci/base but https://build.opensuse.org/project/show/devel:openQA:ci included a link to qemu from openSUSE:Leap:15.1:Update (yes, 15.1!), which obviously we did as workaround for some problem at that time but I don't remember which one. Now Leap 15.2 has newer versions for qemu but in the base we still had the old version qemu-3.1.1.1 from 15.1 installed which can explain the conflict asking for a vendor change. I removed the linked package and "base" is currently rebuilding.

EDIT: "base" image is rebuilt. I added a description to https://build.opensuse.org/package/show/devel:openQA:ci/base to explain my changes as well. Trying out the refreshed container image locally works fine. I retriggered the nightly pipeline on CI: https://app.circleci.com/pipelines/github/os-autoinst/openQA/7869/workflows/4ece0f1a-d3fb-4692-b856-584cf4d6dc66/jobs/74398

the job did not create a new specific pull request. Reading the code I assume that is because actually no dependencies change. Could that be?

EDIT: 2021-09-25 1340Z: We will see tomorrow if a new automatic dependency PR would be created. In the meantime the actual full-stack.t problems can further be investigated.

#20 Updated by cdywan 2 months ago

Still locally re-running full-stack.t off my full_stack_t_results_panel_context branch since I don't think I've caught all possible issues but still can't force the failures. Meanwhile poking at #98898 since I re-considered if it's really blocked and I don't just want to spend time looking at passing tests

#21 Updated by cdywan about 2 months ago

  • Status changed from In Progress to Feedback

cdywan wrote:

Still locally re-running full-stack.t off my full_stack_t_results_panel_context branch since I don't think I've caught all possible issues but still can't force the failures. Meanwhile poking at #98898 since I re-considered if it's really blocked and I don't just want to spend time looking at passing tests

Results are looking stable for me locally, and even in CI. I updated the branch and added missing details to the description, and I'd like to get this merged in any case.

#22 Updated by cdywan about 2 months ago

I didn't get to take review comments into account because I was investigating #99195, hoping to do that today

#23 Updated by cdywan about 2 months ago

Looks like the myjsonrpc error is back. Unfortunately despite setting DEBUG_JSON in start_worker now I don't see the ($$) send_json($fd) JSON=$copy message I would expect 🤨️

    # 0
FAILED--Further testing stopped.
Retry 5 of 5 …
[16:35:16] t/full-stack.t .. 21/?     # javascript console output, waiting for connection opened: [
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/D6Ug4fkKaODmApGW/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1633538136370,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/D6Ug4fkKaODmApGW/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1633538136559,
    #   },
    # ]
    # javascript console output, waiting for connection opened: [
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/D6Ug4fkKaODmApGW/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1633538137700,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/D6Ug4fkKaODmApGW/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1633538137700,
    #   },
    # ]
[16:35:16] t/full-stack.t .. 25/?     # javascript console output, waiting for paused: [
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/D6Ug4fkKaODmApGW/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1633538142773,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/D6Ug4fkKaODmApGW/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1633538142773,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/D6Ug4fkKaODmApGW/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1633538142773,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/D6Ug4fkKaODmApGW/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1633538142773,
    #   },
    # ]
[16:35:16] t/full-stack.t .. 26/?     # stopped waiting for '(?^u:Result: passed)', result turned out to be 'failed'
    # full result panel contents:
    # 0 1
    # Result: failed finished less than a minute ago ( 00:10 minutes )
    # Reason: isotovideo done: Can't use an undefined value as a symbol reference at /home/squamata/os-autoinst/myjsonrpc.pm line 42.
    # Scheduled product: job has not been created by posting an ISO
    # Assigned worker: d7cabfd85cc4:1

    #   Failed test 'Expected result for job 1 not found'
    #   at /home/squamata/project/t/lib/OpenQA/Test/FullstackUtils.pm line 99.

    #   Failed test 'job 1 passed'
    #   at t/full-stack.t line 192.
    # 0
Bailout called.  Further testing stopped:  0

#24 Updated by cdywan about 2 months ago

  • find fullstack.t autoinst.txt and preserve it somewhere persistently
  • attempt to reproduce "isotovideo done: Can't use an undefined value as a symbol" in os-autoinst
  • maybe the console closing caused an invalid file handle
  • shorten the fullstack test?

#25 Updated by cdywan about 2 months ago

I'm splitting out fixes now, according to specific issues:

https://github.com/os-autoinst/openQA/pull/4285

[13:04:01] t/full-stack.t .. 98/? Bailout called.  Further testing stopped:  clickElement: element not interactable at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm:81
FAILED--Further testing stopped: clickElement: element not interactable at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm:81
make[2]: *** [Makefile:195: test-unit-and-integration] Error 255
make[2]: Leaving directory '/home/squamata/project'
make[1]: *** [Makefile:190: test-with-database] Error 2
make[1]: Leaving directory '/home/squamata/project'
make: *** [Makefile:179: test-fullstack] Error 2

Exited with code exit status 2

https://github.com/os-autoinst/openQA/pull/4286

t/full-stack.t .. 36/? Bailout called.  Further testing stopped:  findElement: no such element: Unable to locate element: {"method":"css 
selector","selector":"#info_box .card-body"} at /home/kalikiana/openQA-dev/repos/openQA/t/lib/OpenQA/SeleniumTest.pm:81

#26 Updated by okurz about 2 months ago

  • Due date changed from 2021-10-08 to 2021-10-22
  • Priority changed from Urgent to High

Seems like in master there are next to no full-stack failures hitting non-related PRs so reducing prio. For the still open PRs I am confident we can get them merged within the next two weeks

#27 Updated by cdywan about 2 months ago

Split off another PR based on a commit adding an uncaught console error, but instead proposing to make console errors consistently fail tests: https://github.com/os-autoinst/openQA/pull/4288

#28 Updated by cdywan about 2 months ago

cdywan wrote:

[13:04:01] t/full-stack.t .. 98/? Bailout called.  Further testing stopped:  clickElement: element not interactable at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm:81
FAILED--Further testing stopped: clickElement: element not interactable at /home/squamata/project/t/lib/OpenQA/SeleniumTest.pm:81

https://github.com/os-autoinst/openQA/pull/4286

t/full-stack.t .. 36/? Bailout called.  Further testing stopped:  findElement: no such element: Unable to locate element: {"method":"css 
selector","selector":"#info_box .card-body"} at /home/kalikiana/openQA-dev/repos/openQA/t/lib/OpenQA/SeleniumTest.pm:81

Alternative, global approach to element find timeouts https://github.com/os-autoinst/openQA/pull/4293

#29 Updated by okurz about 2 months ago

both last mentioned PRs still open. https://app.circleci.com/pipelines/github/os-autoinst/openQA/8095/workflows/51b61f49-36f3-43a9-961a-715ee2af25c3/jobs/76539/parallel-runs/0/steps/0-111 in one of my pull requests looks lkie related:

[17:00:39] t/full-stack.t .. 23/?     # javascript console output, waiting for connection opened: [
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/wUn0NgQsFzk9IVSB/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1634230882315,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/wUn0NgQsFzk9IVSB/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1634230882682,
    #   },
    # ]
    # javascript console output, waiting for connection opened: [
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/wUn0NgQsFzk9IVSB/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1634230883918,
    #   },
    #   {
    #     level     => "SEVERE",
    #     message   => "http://localhost:9526/asset/88ed2c606e/ws_console.js 8 WebSocket connection to 'ws://127.0.0.1:20013/wUn0NgQsFzk9IVSB/ws' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED",
    #     source    => "network",
    #     timestamp => 1634230883918,
    #   },
    # ]

#30 Updated by okurz about 2 months ago

I found again annoying full-stack problems, e.g. in
https://app.circleci.com/pipelines/github/os-autoinst/openQA/8119/workflows/52e0fa2d-164f-475b-ab07-68129f483bca/jobs/76744/steps
I manually retriggered and tests worked so I created https://github.com/os-autoinst/openQA/pull/4314 to add even more retries.

I also created
https://github.com/os-autoinst/os-autoinst/pull/1827
to introduce signatures in os-autoinst myjsonrpc. This might help us with the ugly error

# Reason: isotovideo done: Can't use an undefined value as a symbol reference at /home/squamata/os-autoinst/myjsonrpc.pm line 42.

But likely it won't help because there will be an argument supplied only that the socket behind that is actually undef.

#31 Updated by cdywan about 1 month ago

https://github.com/os-autoinst/openQA/pull/4286

t/full-stack.t .. 36/? Bailout called.  Further testing stopped:  findElement: no such element: Unable to locate element: {"method":"css 
selector","selector":"#info_box .card-body"} at /home/kalikiana/openQA-dev/repos/openQA/t/lib/OpenQA/SeleniumTest.pm:81

Still scratching my head on how this somehow reduces coverage and trying things out. I don't even understand how this can affect coverage (see comments on the PR).

Alternative, global approach to element find timeouts https://github.com/os-autoinst/openQA/pull/4293

Changed this to drop implicit timeouts instead. Still confirming necessary fixes for tests that were relying on the 2s delay.

cdywan wrote:

Proposed an attempt at improving diagnostics to add context:

  • the job we're waiting on
  • subtests to allow failures to retain context
  • JSON errors from os-autoinst

See https://github.com/os-autoinst/openQA/pull/4243

Merged. Most importantly this will reveal logs of failed "jobs" in full-stack.t in CI (and locally w/o special options).

#32 Updated by cdywan about 1 month ago

  • Due date changed from 2021-10-22 to 2021-11-12
  • Priority changed from High to Normal

As discussed briefly in the weekly, I'm reducing the priority since PRs don't fail due to retries. I plan to reduce retries again once my outstanding PRs addressing racy element interactions are merged.

#33 Updated by cdywan about 1 month ago

https://github.com/os-autoinst/os-autoinst/pull/1828 got merged meaning jobs should fail with myjsonprc: called on undefined file descriptor rather than isotovideo done: Can't use an undefined value as a symbol reference at. in combination with https://github.com/os-autoinst/openQA/pull/4243 I'm hoping to be able to see why

#34 Updated by okurz about 1 month ago

  • Related to action #101734: Prevent "called on undefined file descriptor" in myjsonrpc.pm:40 in openQA t/full-stack.t added

#35 Updated by cdywan 24 days ago

Based on an idea in conversation I prepared a proof of concept so we can more easily test stability w/o having to modify CI config files in separate branches that we won't want to merge and simply add e.g. #fullstack-stability to unmodified PRs.

#36 Updated by cdywan 23 days ago

  • Status changed from Feedback to In Progress

Should set this to In Progress.

#37 Updated by cdywan 20 days ago

  • Due date changed from 2021-11-12 to 2021-11-19

cdywan wrote:

Based on an idea in conversation I prepared a proof of concept so we can more easily test stability w/o having to modify CI config files in separate branches that we won't want to merge and simply add e.g. #fullstack-stability to unmodified PRs.

Ended up preparing a reference PR anyway since the GHA isn't fully working. Was confused by errors also mentioned in #102332. But this is also because we want to be conservative here and I wanted to demonstrate that we do improve, and not somehow add new instabilities.

#38 Updated by okurz 17 days ago

  • Priority changed from Normal to High

t/full-stack.t is becoming more and more of a problem, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/8392/workflows/6748c60b-4966-4db6-806b-415da997d321/jobs/79281 from https://github.com/os-autoinst/openQA/pull/4360 . I suggest more people to collaborate on this ticket.

#39 Updated by cdywan 16 days ago

  • Due date deleted (2021-11-19)
  • Status changed from In Progress to New
  • Assignee deleted (cdywan)

okurz wrote:

t/full-stack.t is becoming more and more of a problem, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/8392/workflows/6748c60b-4966-4db6-806b-415da997d321/jobs/79281 from https://github.com/os-autoinst/openQA/pull/4360 . I suggest more people to collaborate on this ticket.

    #   Failed test 'Expected result for job 1 not found'

And that's yet another failure. I'm setting this back to New now so it can be re-assessed. And either we agree on what exact error(s) will be fixed here, or it needs to be an epic.

#40 Updated by cdywan 16 days ago

cdywan wrote:

okurz wrote:

t/full-stack.t is becoming more and more of a problem, e.g. see https://app.circleci.com/pipelines/github/os-autoinst/openQA/8392/workflows/6748c60b-4966-4db6-806b-415da997d321/jobs/79281 from https://github.com/os-autoinst/openQA/pull/4360 . I suggest more people to collaborate on this ticket.

    #   Failed test 'Expected result for job 1 not found'

And that's yet another failure. I'm setting this back to New now so it can be re-assessed. And either we agree on what exact error(s) will be fixed here, or it needs to be an epic.

#41 Updated by cdywan 16 days ago

  • Tracker changed from action to coordination
  • Subject changed from t/full-stack.t can timeout, "clickElement: element not interactable" size:M to [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errors
  • Status changed from New to Blocked
  • Assignee set to cdywan

Making this an epic now

#42 Updated by cdywan 16 days ago

  • Copied to action #102578: [sporadic] t/full-stack.t Failed test 'Expected result for job 1 not found' size:M added

#43 Updated by cdywan 16 days ago

Also available in: Atom PDF