Project

General

Profile

Actions

action #114881

closed

[sporadic] OBS checks fail os-autoinst test "exceeds runtime limit of '200' seconds " from t/27-consoles-vmware.t on ppc64le size:M

Added by livdywan about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-07-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

From https://build.opensuse.org/package/live_build_log/devel:openQA/os-autoinst/15.4/ppc64le

3: Bailout called.  Further testing stopped:  test './t/27-consoles-vmware.t' exceeds runtime limit of '200' seconds
[  698s] 3: FAILED--Further testing stopped: test './t/27-consoles-vmware.t' exceeds runtime limit of '200' seconds
[  698s] 3/3 Test #3: test-perl-testsuite ..............***Failed  435.10 sec

Suggestions

  • Exclude the test from OBS checks on ppc64le
  • Increase the timeout
  • Understand why iff there's a bug in the (test) code

Rollback steps

  • Enable test on OBS checks again

Files

obs-vmware-test-failure.log.txt (334 KB) obs-vmware-test-failure.log.txt Test failure on Leap_15.4/x86_64 mkittler, 2022-09-05 14:34
obs-vmware-test-failure-2.log.txt (345 KB) obs-vmware-test-failure-2.log.txt mkittler, 2022-09-07 13:58
Actions #1

Updated by okurz about 2 years ago

  • Target version set to Ready
Actions #2

Updated by livdywan about 2 years ago

Note that I re-triggered the build and it passed this time so maybe we don't need to exclude it for now, but it's still racy of course I meant to. I think after my toggling work-around for #108530#note-23 it's actually being rebuilt.

I prepared a branch to exclude the test because it looks to me like increasing the timeout won't help, considering it fails before there's any meaningful output: https://github.com/os-autoinst/os-autoinst/pull/2139

Actions #3

Updated by livdywan about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

I should probably take the ticket, although for now I'm not planning to do an in-depth investigation beyond trying to reproduce on x86-76.

Actions #4

Updated by openqa_review about 2 years ago

  • Due date set to 2022-08-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by livdywan about 2 years ago

  • Status changed from In Progress to Feedback

cdywan wrote:

I prepared a branch to exclude the test because it looks to me like increasing the timeout won't help, considering it fails before there's any meaningful output: https://github.com/os-autoinst/os-autoinst/pull/2139

The branch was merged. Maybe we want to leave it at that?

After discussing it in the daily I'm going to try and propose a follow-up branch to exclude the test only on ppc64le e.g. using something like %ifnarch ppc64le.

Actions #6

Updated by livdywan about 2 years ago

  • Subject changed from [sporadic] OBS checks fail os-autoinst test "exceeds runtime limit of '200' seconds " from t/27-consoles-vmware.t on ppc64le to [sporadic] OBS checks fail os-autoinst test "exceeds runtime limit of '200' seconds " from t/27-consoles-vmware.t on ppc64le size:S
Actions #7

Updated by livdywan about 2 years ago

cdywan wrote:

After discussing it in the daily I'm going to try and propose a follow-up branch to exclude the test only on ppc64le e.g. using something like %ifnarch ppc64le.

https://github.com/os-autoinst/os-autoinst/pull/2142

I found various examples of %ifarch ppc64le in the wild so that's what I'm using now.

Actions #8

Updated by livdywan about 2 years ago

  • Status changed from Feedback to Resolved

cdywan wrote:

https://github.com/os-autoinst/os-autoinst/pull/2142

I found various examples of %ifarch ppc64le in the wild so that's what I'm using now.

Merged

Actions #9

Updated by tinita about 2 years ago

Now I saw it on x86_64: https://build.opensuse.org/package/live_build_log/devel:openQA/os-autoinst/openSUSE_Tumbleweed/x86_64

[  572s] 3: Bailout called.  Further testing stopped:  test './t/27-consoles-vmware.t' exceeds runtime limit of '200' seconds
[  572s] 3: [17:59:59] ./t/27-consoles-vmware.t ................... 
[  572s] 3: # Subtest: test configuration with fake URL
[  572s] 3:     ok 1 - noop if URL not set
[  572s] 3:     ok 2 - log message present without secrets
[  572s] 3:     ok 3 - VMWare "console" returned if URL is set
[  572s] 3:     ok 4 - hostname saved as original hostname
[  572s] 3:     ok 5 - original hostname set to hostname
[  572s] 3:     ok 6 - hostname assigned
[  572s] 3:     ok 7 - hostname set to localhost
[  572s] 3:     ok 8 - description assigned
[  572s] 3:     ok 9 - description set accordingly
[  572s] 3:     ok 10 - dewebsockify called with expected args
[  572s] 3:     ok 11 - hostname set
[  572s] 3:     ok 12 - no VM-ID set (as our URL did not include one)
[  572s] 3:     ok 13 - username set
[  572s] 3:     ok 14 - password set (with URL-encoded character)
[  572s] 3:     ok 15 - protocol configured from URL
[  572s] 3:     ok 16 - host configured from URL
[  572s] 3:     ok 17 - specific VM-ID configured from URL
[  572s] 3:     1..17
[  572s] 3: ok 1 - test configuration with fake URL
[  572s] 3: # Subtest: request WebSockets URL
[  572s] 3:     ok 1 - auth error handled
[  572s] 3:     ok 2 - ws request error handled
[  572s] 3:     ok 3 - no ws URL handled
[  572s] 3:     ok 4 - no cookie handled
[  572s] 3:     ok 5 - URL found
[  572s] 3:     ok 6 - cookie returned
[  572s] 3:     1..6
[  572s] 3: ok 2 - request WebSockets URL
[  572s] 3: # Subtest: deducing VNC over WebSockets URL from vars
[  572s] 3:     ok 1 - no URL if VMWARE_VNC_OVER_WS not set
[  572s] 3:     ok 2 - no URL if VIRSH_GUEST not matching
[  572s] 3:     ok 3 - error if vars specified inconsistently
[  572s] 3:     ok 4 - error if password missing
[  572s] 3:     ok 5 - URL deduced from vars
[  572s] 3:     ok 6 - original hostname used to check if VIRSH_GUEST matching
[  572s] 3:     1..6
[  572s] 3: ok 3 - deducing VNC over WebSockets URL from vars
[  572s] 3: # Subtest: turning WebSocket into normal socket via dewebsockify
[  572s] 3:     ok 1 - could start test WebSocket server
[  572s] 3:     ok 2 - dewebsockify PID tracked
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1999 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1998 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1997 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1996 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1995 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1994 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1993 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1992 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1991 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1990 times)
[  572s] 3:     ok 3 - expected data received via raw socket
[  572s] 3:     ok 4 - expected data received via WebSocket
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1989 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1988 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1987 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1986 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1985 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1984 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1983 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1982 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1981 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1980 times)
[  572s] 3:     ok 5 - error logged
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1979 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1978 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1977 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1976 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1975 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1974 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1973 times)
[  572s] 3:     # unable to connect to dewebsockify on port 49607: Connection refused (will try again 1972 times)
[  572s] 3: Dubious, test returned 255 (wstat 65280, 0xff00)
[  572s] 3: All 4 subtests passed 
[  572s] 3: [17:59:59]
[  572s] 3: 
[  572s] 3: Test Summary Report
[  572s] 3: -------------------
[  572s] 3: ./t/27-consoles-vmware.t                 (Wstat: 65280 (exited 255) Tests: 4 Failed: 0)
[  572s] 3:   Non-zero exit status: 255
[  572s] 3:   Parse errors: No plan found in TAP output

At least it seems to execute some of the tests...

Actions #10

Updated by livdywan about 2 years ago

  • Due date deleted (2022-08-17)
  • Status changed from Resolved to Workable
  • Assignee deleted (livdywan)
Actions #11

Updated by livdywan about 2 years ago

  • Subject changed from [sporadic] OBS checks fail os-autoinst test "exceeds runtime limit of '200' seconds " from t/27-consoles-vmware.t on ppc64le size:S to [sporadic] OBS checks fail os-autoinst test "exceeds runtime limit of '200' seconds " from t/27-consoles-vmware.t on ppc64le size:M
Actions #12

Updated by mkittler about 2 years ago

  • Assignee set to mkittler
Actions #13

Updated by mkittler about 2 years ago

  • Status changed from Workable to In Progress

This PR may help: https://github.com/os-autoinst/os-autoinst/pull/2153 (see its commit message for details)

Actions #14

Updated by openqa_review about 2 years ago

  • Due date set to 2022-09-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by mkittler about 2 years ago

  • Status changed from In Progress to Feedback

The PR has been merged. So far I haven't seen the issue again. Let's see whether tests on https://github.com/os-autoinst/os-autoinst/pull/2156 pass (which will enable them on PowerPC again).

Actions #16

Updated by mkittler about 2 years ago

Unfortunately it happened again but I haven't saved the entire log. So I'm waiting for another occurrence and will preserve the OBS build log.

Actions #18

Updated by mkittler about 2 years ago

Unfortunately these logs don't contain much to go on. For now I'll just disable the test completely in OBS: https://github.com/os-autoinst/os-autoinst/pull/2162

Actions #19

Updated by mkittler about 2 years ago

Now I've got an occurrence that has a little bit more logging output.

Actions #20

Updated by mkittler about 2 years ago

Considering also the last logs I'm wondering whether this test is really only taking so long on OBS. It doesn't reach the time limit in a moment where it looks stuck (like a loop of always failing connection attempts, in fact it ends with a successful connection being logged and the previous subtest which are very similar have passed before). However, we already scale the timeout to be 200 seconds. Even with coverage enabled it takes locally < 5 seconds. So it seems very unlikely that it really takes that long. Unfortunately we cannot see how much time passed between the log lines because CMake just replies the output as one big junk.

Actions #21

Updated by mkittler about 2 years ago

When this test passes on OBS it is actually very fast as well:

…
[  158s] ok 5 - multiple attempts to launch VNC server
[  158s] # Subtest: test against real VMWare instance
[  158s]     1..0 # SKIP Set OS_AUTOINST_TEST_AGAINST_REAL_VMWARE_INSTANCE to run this test.
[  158s] ok 6 # skip Set OS_AUTOINST_TEST_AGAINST_REAL_VMWARE_INSTANCE to run this test.
[  158s] ok 7 - no (unexpected) warnings (via done_testing)
[  158s] 1..7
[  158s] ok     1619 ms ( 0.01 usr  0.00 sys +  1.39 cusr  0.15 csys =  1.55 CPU)
[  160s] [10:02:07] ./t/27-consoles-vnc.t ...................... 
…

So something must hang after the connection has been established and before waitpid is executed.

Actions #22

Updated by mkittler about 2 years ago

This PR which has unbuffered output enabled shows where it hangs: https://github.com/os-autoinst/os-autoinst/pull/2167

[  159s] 3: [14:02:16] ./t/27-consoles-vmware.t ................... 
[  159s] 3: # Subtest: test configuration with fake URL
[  159s] 3:     ok 1 - noop if URL not set
[  159s] 3:     ok 2 - log message present without secrets
[  159s] 3:     ok 3 - VMWare "console" returned if URL is set
[  159s] 3:     ok 4 - hostname saved as original hostname
[  159s] 3:     ok 5 - original hostname set to hostname
[  159s] 3:     ok 6 - hostname assigned
[  159s] 3:     ok 7 - hostname set to localhost
[  159s] 3:     ok 8 - description assigned
[  159s] 3:     ok 9 - description set accordingly
[  159s] 3:     ok 10 - dewebsockify called with expected args
[  159s] 3:     ok 11 - hostname set
[  159s] 3:     ok 12 - no VM-ID set (as our URL did not include one)
[  159s] 3:     ok 13 - username set
[  159s] 3:     ok 14 - password set (with URL-encoded character)
[  159s] 3:     ok 15 - protocol configured from URL
[  159s] 3:     ok 16 - host configured from URL
[  159s] 3:     ok 17 - specific VM-ID configured from URL
[  159s] 3:     1..17
[  159s] 3: ok 1 - test configuration with fake URL
[  159s] 3: # Subtest: request WebSockets URL
[  159s] 3:     ok 1 - auth error handled
[  159s] 3:     ok 2 - ws request error handled
[  159s] 3:     ok 3 - no ws URL handled
[  159s] 3:     ok 4 - no cookie handled
[  159s] 3:     ok 5 - URL found
[  159s] 3:     ok 6 - cookie returned
[  159s] 3:     1..6
[  159s] 3: ok 2 - request WebSockets URL
[  159s] 3: # Subtest: deducing VNC over WebSockets URL from vars
[  159s] 3:     ok 1 - no URL if VMWARE_VNC_OVER_WS not set
[  159s] 3:     ok 2 - no URL if VIRSH_GUEST not matching
[  159s] 3:     ok 3 - error if vars specified inconsistently
[  159s] 3:     ok 4 - error if password missing
[  159s] 3:     ok 5 - URL deduced from vars
[  159s] 3:     ok 6 - original hostname used to check if VIRSH_GUEST matching
[  159s] 3:     1..6
[  159s] 3: ok 3 - deducing VNC over WebSockets URL from vars
[  159s] 3: # Subtest: turning WebSocket into normal socket via dewebsockify
[  159s] 3:     ok 1 - could start test WebSocket server
[  159s] 3:     ok 2 - dewebsockify PID tracked: 3910
[  159s] 3:     # connecting to dewebsockify on port 46259
[  159s] 3:     # unable to connect to dewebsockify on port 46259: Connection refused (will try again 499 times)
[  159s] 3:     # connecting to dewebsockify on port 46259
[  159s] 3:     # unable to connect to dewebsockify on port 46259: Connection refused (will try again 498 times)
[  159s] 3:     # connecting to dewebsockify on port 46259
[  159s] 3:     # unable to connect to dewebsockify on port 46259: Connection refused (will try again 497 times)
[  159s] 3:     # connecting to dewebsockify on port 46259
[  159s] 3:     # unable to connect to dewebsockify on port 46259: Connection refused (will try again 496 times)
[  159s] 3:     # connecting to dewebsockify on port 46259
[  159s] 3:     # connection to dewebsockify established via port 46259
[  159s] 3:     ok 3 - expected data received via raw socket
[  159s] 3:     ok 4 - expected data received via WebSocket
[  159s] 3:     # waiting for dewebsockify process to terminate, pid: 3910
[  159s] 3:     # Subtest: handle error when WebSocket server is not reachable
[  159s] 3:         # connecting to dewebsockify on port 46259
[  159s] 3:         # unable to connect to dewebsockify on port 46259: Connection refused (will try again 499 times)
[  159s] 3:         # connecting to dewebsockify on port 46259
[  159s] 3:         # unable to connect to dewebsockify on port 46259: Connection refused (will try again 498 times)
[  159s] 3:         # connecting to dewebsockify on port 46259
[  159s] 3:         # unable to connect to dewebsockify on port 46259: Connection refused (will try again 497 times)
[  159s] 3:         # connecting to dewebsockify on port 46259
[  159s] 3:         # unable to connect to dewebsockify on port 46259: Connection refused (will try again 496 times)
[  160s] 3:         # connecting to dewebsockify on port 46259
[  160s] 3:         # connection to dewebsockify established via port 46259
[  160s] 3:         # closing connection to dewebsockify immediately
[  160s] 3:         # waiting for dewebsockify process to terminate, pid: 3911
[  160s] 3:         ok 1 - error logged
[  160s] 3:         1..1
[  160s] 3:     ok 5 - handle error when WebSocket server is not reachable
[  160s] 3:     # Subtest: handle error when HTTP server is not upgrading to WebSockets
[  160s] 3:         # connecting to dewebsockify on port 46259
[  160s] 3:         # unable to connect to dewebsockify on port 46259: Connection refused (will try again 499 times)
[  160s] 3:         # connecting to dewebsockify on port 46259
[  160s] 3:         # unable to connect to dewebsockify on port 46259: Connection refused (will try again 498 times)
[  160s] 3:         # connecting to dewebsockify on port 46259
[  160s] 3:         # unable to connect to dewebsockify on port 46259: Connection refused (will try again 497 times)
[  160s] 3:         # connecting to dewebsockify on port 46259
[  160s] 3:         # unable to connect to dewebsockify on port 46259: Connection refused (will try again 496 times)
[  160s] 3:         # connecting to dewebsockify on port 46259
[  160s] 3:         # connection to dewebsockify established via port 46259
[  358s] 3: Bailout called.  Further testing stopped:  test './t/27-consoles-vmware.t' exceeds runtime limit of '200' seconds
[  358s] 3: Bail out!  test './t/27-consoles-vmware.t' exceeds runtime limit of '200' seconds
[  358s] 3: Dubious, test returned 255 (wstat 65280, 0xff00)
[  358s] 3: All 4 subtests passed

So the test execution was very fast before and it really just hangs on the last line.

The test is expecting dewebsockify to terminate and close the connection here. Then it'll stop the event loop but that apparently never happens. So there must be something wrong with that, e.g. the callback registered via $self->tx->on(finish => sub { Mojo::IOLoop->stop }); is never invoked. (Commenting out that line will obviously reproduce the issue; the question is why it is not invoked in some cases.)

Actions #23

Updated by livdywan about 2 years ago

  • Due date changed from 2022-09-10 to 2022-09-16

Discussion on-going, hence bumping the due date. I'm also trying to get a better idea of what https://github.com/os-autoinst/os-autoinst/blob/master/t/27-consoles-vmware.t#L191 does and maybe provide some fresh input

Actions #24

Updated by mkittler about 2 years ago

Note that https://github.com/os-autoinst/os-autoinst/pull/2167 contains a commit to simplify the VMWare test a little bit. So the line $t->ua->ioloop->start until $client_callback; is no longer a loop. I'm not sure yet whether that helped. I was under the impression it did not because TW checks are reported as failing. However, if I now look on the logs on OBS itself everything looks good. I'll retrigger the build a few times to see whether that's actually stable.

Actions #25

Updated by kraih about 2 years ago

Had a quick look at the test code, and two things did jump out at me, 1) there is a lot of open/close/read... calls without exception handling, and i did not see autodie being used (so they could fail quietly), 2) there is mixed use of $t->ua->ioloop->... and Mojo::IOLoop->..., which can easily result in deadlocks if more than one event loop gets initialized accidentally, best to only use one way to reference the event loop instance.

Since the test only fails in a specific environment, it might be interesting to know if the event loop backend used is Mojo::Reactor::Poll or Mojo::Reactor::EV. There are slight differences in behavior that could make replicating a problem very hard.

Actions #26

Updated by kraih about 2 years ago

cdywan wrote:

I'm also trying to get a better idea of what https://github.com/os-autoinst/os-autoinst/blob/master/t/27-consoles-vmware.t#L191 does and maybe provide some fresh input

That's the reconnect for the next attempt, delayed by 0.1 seconds.

Actions #27

Updated by kraih about 2 years ago

mkittler wrote:

Note that https://github.com/os-autoinst/os-autoinst/pull/2167 contains a commit to simplify the VMWare test a little bit. So the line $t->ua->ioloop->start until $client_callback; is no longer a loop.

I just don't see how that would make a difference. In the unbuffered output we are seeing multiple unable to connect to dewebsockify on port... messages, and that would mean $client_callback is true anyway since that callback has been called multiple times.

Actions #28

Updated by okurz about 2 years ago

  • Due date deleted (2022-09-16)
  • Status changed from Feedback to Resolved

discussed in weekly unblock of SUSE QE Tools 2022-09-14. We don't know what the specific problem is within the OBS environment. We don't see it as feasible or useful to investigate further. We agreed to keep the test disabled with a comment pointing to this ticket in the spec file and just keeping the test within the CI environment.

Actions

Also available in: Atom PDF