action #49961
closedPrevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes
0%
Description
Acceptance criteria¶
- AC: Prevent svirt backend to hang on
virsh undefine
command, e.g. reduce the timeout for this command.
Observation¶
[2019-04-02T18:12:36.938 CEST] [debug] Command executed: virsh destroy openQA-SUT-1, ret=0
[2019-04-02T18:12:37.090 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-04-02T18:12:37.182 CEST] [debug] Command's stdout:
Domain openQA-SUT-1 has been undefined
[2019-04-02T18:12:37.182 CEST] [debug] Command executed: virsh undefine --snapshots-metadata openQA-SUT-1, ret=0 <<<<<<<< apparently it hangs after attempting to execute this command
[2019-04-02T19:39:01.274 CEST] [debug] backend got TERM
[2019-04-02T19:39:01.275 CEST] [debug] Destroying openQA-SUT-1 virtual machine
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":60535"
after 15744 requests (15744 known processed) with 0 events remaining.
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":57603"
after 15740 requests (15583 known processed) with 0 events remaining.
[2019-04-02T19:39:01.290 CEST] [debug] signalhandler got TERM
[2019-04-02T19:39:01.290 CEST] [debug] killing backend process 657476
Reproducible¶
Fails since Build 204.1
in scenario sle-15-SP1-Installer-DVD-s390x-gnome+proxy_SCC+allmodules@s390x-kvm-sle12
Latest builds are working again.
Expected result¶
Last good: 199.1
Further details¶
Always latest result in this scenario: latest
Updated by mgriessmeier over 5 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Subject changed from [functional][u] test ends incomplete - os-autoinst finishes with result: timeout to [tools][functional][u] test ends incomplete - os-autoinst finishes with result: timeout
- Description updated (diff)
- Category changed from Bugs in existing tests to Regressions/Crashes
- Status changed from New to Workable
- Target version set to Milestone 24
Updated by szarate over 5 years ago
I would suggest that we can create a needle that would try to match the current screen: https://openqa.suse.de/tests/2843000#step/hostname_inst/5, if it matches (except for the ip address), mark it as a failed test, with a reference to either a progress ticket or a bsc. That should be good enogh for the first run
Updated by zluo over 5 years ago
- Status changed from Workable to In Progress
- Assignee set to zluo
take over and check.
Updated by mkittler over 5 years ago
Note that the message
[2019-04-02T18:12:34.094 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 171.
does not mean the test timed out after reaching the configured timeout of usually 2 hours.
It just means that isotovideo could not tell the websocket clients that the test execution ended. This is required for the developer mode to suppress irritating error messages. This debug message has nothing to do with the test execution itself. It is printed after the test execution has already ended.
I'm not sure why it fails, though. But it looks like it actually just took more than 15 seconds to distribute that message to the openqa-livehandler daemon.
Updated by mkittler over 5 years ago
Looking at the job history there are indeed timed-out jobs. But still, the error message stated under 'Observation' has nothing to do with these timeouts.
It seems that the test execution time has been reduced in more recent testruns (e.g. https://openqa.suse.de/tests/2862771).
Updated by zluo over 5 years ago
compare
normal test run takes less than 60 minutes:
https://openqa.suse.de/tests/2862771/file/video.ogv at 3:29
with incomplete test run takes about 3 hours:
https://openqa.suse.de/tests/2763448/file/video.ogv at end 5:00
running test code from apachetest.pm
I found the performance from incomplete test is poor, but I don't think it takes really exact 3:00 hours.
If SUT is not done with test, then It should stop kvm guest for running further test modules, but this is failed which mentioned in
isotovideo: unable to inform websocket clients about stopping command server
Updated by zluo over 5 years ago
so what I can say: all tests till consoletest_finish runs completely and successfully.
We have for this case only a problem for: isotovideo send command to end running tests on SUT after 3:00 hours, but SUT is overloaded somehow and it cannot stop working and it continues running till end of tests.
Updated by zluo over 5 years ago
need to check:
https://openqa.suse.de/tests/2867413#next_previous
Updated by mkittler over 5 years ago
If SUT is not done with test, then It should stop kvm guest for running further test modules, but this is failed which mentioned in
The message 'isotovideo: unable to inform websocket clients about stopping command server' and the problem that the SUT continues to run are not related.
Updated by zluo over 5 years ago
over 100 test runs on osd don't show this issue.
2 failures: ssc_registration (executed by grenache-1:13) and reconnect_mgmt_console (executed by grenache-1:14) which are not related with this issue.
Updated by zluo over 5 years ago
compare successful test runs:
https://openqa.suse.de/tests/2867385/file/autoinst-log.txt
[2019-05-07T10:59:08.462 CEST] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/console/consoletest_finish.pm:40 called testapi::wait_still_screen
[2019-05-07T10:59:08.462 CEST] [debug] <<< testapi::wait_still_screen(timeout=30, stilltime=2, similarity_level=47)
[2019-05-07T10:59:10.502 CEST] [debug] >>> testapi::wait_still_screen: detected same image for 2 seconds, last detected similarity is 1000000
[2019-05-07T10:59:10.729 CEST] [debug] ||| finished consoletest_finish console at 2019-05-07 08:59:10 (13 s)
[2019-05-07T10:59:10.730 CEST] [debug] killing autotest process 333213
[2019-05-07T10:59:10.733 CEST] [debug] [autotest] process exited: 0
[2019-05-07T10:59:11.734 CEST] [debug] done with autotest process
[2019-05-07T10:59:11.734 CEST] [debug] killing command server 333212 because test execution ended
[2019-05-07T10:59:11.734 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/heeley63kbisPxst/broadcast
[2019-05-07T10:59:12.744 CEST] [debug] commands process exited: 0
[2019-05-07T10:59:13.745 CEST] [debug] done with command server
[2019-05-07T10:59:13.745 CEST] [debug] isotovideo done
[2019-05-07T10:59:13.843 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-05-07T10:59:13.843 CEST] [debug] <<< backend::svirt::run_cmd(Net::SSH2=SCALAR(0x100205e8c00)='! virsh dominfo openQA-SUT-1 | grep -w \'shut off\'', hostname='s390p8.suse.de', password='.zKVM.')
[2019-05-07T10:59:13.931 CEST] [debug] Command executed: '! virsh dominfo openQA-SUT-1 | grep -w 'shut off'', ret=0
[2019-05-07T10:59:13.943 CEST] [debug] BACKEND SHUTDOWN 0
[2019-05-07T10:59:13.944 CEST] [debug] Destroying openQA-SUT-1 virtual machine
[2019-05-07T10:59:14.030 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-05-07T10:59:14.030 CEST] [debug] <<< backend::svirt::run_cmd(Net::SSH2=SCALAR(0x1003dc3d870)='virsh destroy openQA-SUT-1', password='.zKVM.', hostname='s390p8.suse.de')
[2019-05-07T10:59:14.562 CEST] [debug] Command's stdout:
Domain openQA-SUT-1 destroyed
with
https://openqa.suse.de/tests/2763448/file/worker-log.txt
[2019-04-02T19:39:00.0200 CEST] [warn] [pid:542426] max job time exceeded, aborting 02763448-sle-15-SP1-Installer-DVD-s390x-Build204.1-gnome+proxy_SCC+allmodules@s390x-kvm-sle12
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] stop_job timeout
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] ## removing timer update_status
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] ## removing timer job_timeout
[2019-04-02T19:39:00.0201 CEST] [debug] [pid:542426] POST http://openqa.suse.de/api/v1/jobs/2763448/status
[2019-04-02T19:39:00.0272 CEST] [info] [pid:542426] trying to stop job gracefully by announcing it to command server via http://localhost:20123/FLA9DxoQcuJX3vgy/broadcast
[2019-04-02T19:39:00.0273 CEST] [error] [pid:542426] unable to stop the command server gracefully:
[2019-04-02T19:39:00.0273 CEST] [error] [pid:542426] command server likely not reachable at all
[2019-04-02T19:39:09.0054 CEST] [info] [pid:542426] uploading logs_from_installation_system-y2logs.tar.bz2
we can see only difference here is something wrong with command server which is not reachable.
Updated by zluo over 5 years ago
https://openqa.suse.de/tests/2763448/file/autoinst-log.txt
started with time [2019-04-02T16:38:49.0684 CEST] till [2019-04-02T18:12:19.078 CEST] it took about 1 hour 34 minutes to get all tests finished.
in worker.log we see
[2019-04-02T19:39:00.0200 CEST] max job time exceeded
Updated by mkittler over 5 years ago
- Priority changed from High to Normal
After a closer look it seems like the following happens:
- All test modules are executed.
isotovideo
tries to inform web socket clients that the command server that it is going to terminate. That this does not work which might indicate that the command server is too busy at that time. However, it does not cause everything to hang. Instead ...- ... isotovideo attempts to shutdown the backend. This is where the timeout is exceeded. The last line logged by the backend is
[2019-04-02T18:12:37.182 CEST] [debug] Command executed: virsh undefine --snapshots-metadata openQA-SUT-1, ret=0
. So apparently that command or some command after it (which is not logged) hangs. - The worker kills everything after the timeout happened.
The command mentioned under 3. is ran via SSH by the svirt backend. So either virsh undefine
itself hang or the SSH connection (not sure what the timeout for SSH is).
Since the latest jobs doesn't seem to run into the problem anymore I set the priority to normal.
@zulo So this is an issue within the svirt backend. You can unassign yourself if you like somebody else to take over.
Updated by mkittler over 5 years ago
- Related to action #50225: Make JOB_TIMEOUT incompletes more obvious added
Updated by zluo over 5 years ago
- Status changed from In Progress to Rejected
unassign myself at first.
Updated by zluo over 5 years ago
- Status changed from Rejected to Workable
- Assignee deleted (
zluo)
reject this ticket since #50225 will handle this issue.
Updated by mkittler over 5 years ago
- Subject changed from [tools][functional][u] test ends incomplete - os-autoinst finishes with result: timeout to [tools][functional][u] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes
- Description updated (diff)
I adjusted the ticket description and AC because the os-autoinst log actually allows to figure out what's taking so long. Just look at the timestamps beside the log messages. The real problem is that the svirt backend hangs. The log messages stated under observation were also irrelevant.
If the tools team should take over you can change the target version.
Updated by mkittler over 5 years ago
The virsh command itself does not hang because Command executed
is really printed after the command is executed.
This leaves only the SSH connection. After Command executed
the SSH channel is attempted to be closed. Then it might download a file via SCP if open_serial_console_via_ssh
has been called previously. But that's actually not the case in the failed job. Not sure whether closing the SSH channel would cause this timeout. But we actullay don't set any timeout for the SSH connection so the default of having no timeout applies (https://metacpan.org/pod/Net::SSH2#timeout-(-timeout_ms-)).
Than it is 'flushing frames'. Maybe it hangs there. It could be the case because that involves blocking IPC with the video encoder. But then it likely wouldn't be an svirt backend-specific issue.
I guess we have to add more debug printing to find out where it actually hangs and wait until it happens again.
Updated by SLindoMansilla over 5 years ago
- Status changed from Workable to New
- Priority changed from Normal to High
For next grooming
Updated by SLindoMansilla over 5 years ago
- Related to action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out" added
Updated by SLindoMansilla over 5 years ago
- Related to deleted (action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out")
Updated by SLindoMansilla over 5 years ago
- Blocked by action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out" added
Updated by mkittler over 5 years ago
PR for debugging whether it gets stuck before or after 'flushing frames': https://github.com/os-autoinst/os-autoinst/pull/1148
I'm not that sure why this is blocked by "Connection timed out".
Updated by mgriessmeier over 5 years ago
- Target version changed from Milestone 24 to Milestone 25
move to M25
Updated by SLindoMansilla over 5 years ago
- Assignee set to mgriessmeier
As discussed in the grooming meeting
Updated by mgriessmeier over 5 years ago
- Target version changed from Milestone 25 to Milestone 26
Updated by mgriessmeier over 5 years ago
- Target version changed from Milestone 26 to Milestone 28
Updated by coolo about 5 years ago
Perhaps you stop picking assignees if that assignee doesn't work on it for half a year?
Updated by SLindoMansilla about 5 years ago
- Status changed from Blocked to New
It is not a team bug, it is a team feature.
An assignee to a blocked ticket doesn't mean that he will work on it, but that he will track the blocker and update the ticket properly ;)
Blocker rejected: #46919
Updated by coolo about 5 years ago
If it's a team feature, perhaps it shouldn't be in Category 'Concrete Bugs'?
Updated by mkittler about 5 years ago
And I'm still wondering why this is blocked by "Connection timed out" (maybe the tickets are just related?).
Updated by SLindoMansilla about 5 years ago
mkittler wrote:
And I'm still wondering why this is blocked by "Connection timed out" (maybe the tickets are just related?).
It is not blocked, the status is "new" now.
If you refer to the relation, it is because it was blocked. Removing the relation would remove information. A relation of blocked by with a closed ticket (see strikethrough) is not considered by U-Team as blocked anymore.
Updated by mgriessmeier almost 5 years ago
- Target version changed from Milestone 28 to Milestone 30
needs to be discussed offline
Updated by mkittler almost 5 years ago
Is it still happening? If that's the case, is the log messag flushing frames
printed?
Updated by mgriessmeier almost 5 years ago
- Status changed from Blocked to In Progress
haven't seen for a long time, closing for now
Updated by mgriessmeier almost 5 years ago
- Subject changed from [tools][functional][u] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes to [tools] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes
- Status changed from In Progress to New
- Assignee deleted (
mgriessmeier) - Priority changed from High to Normal
or at least unassigning, removing from [u] team and lowering prio
Updated by okurz almost 5 years ago
- Blocked by deleted (action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out")
Updated by okurz almost 5 years ago
- Related to action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out" added
Updated by okurz almost 5 years ago
- Subject changed from [tools] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes to Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes
- Status changed from New to Rejected
- Assignee set to okurz
If the ACs are not fulfilled I guess this can happen anytime again but if you are not affected by this I guess not anyone else is for now.