Project

General

Profile

Actions

action #49961

closed

Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes

Added by SLindoMansilla over 5 years ago. Updated almost 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
SUSE QA (private) - Milestone 30
Start date:
2019-04-03
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC: Prevent svirt backend to hang on virsh undefine command, e.g. reduce the timeout for this command.

Observation


[2019-04-02T18:12:36.938 CEST] [debug] Command executed: virsh  destroy openQA-SUT-1, ret=0
[2019-04-02T18:12:37.090 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-04-02T18:12:37.182 CEST] [debug] Command's stdout:
Domain openQA-SUT-1 has been undefined


[2019-04-02T18:12:37.182 CEST] [debug] Command executed: virsh  undefine --snapshots-metadata openQA-SUT-1, ret=0 <<<<<<<< apparently it hangs after attempting to execute this command
[2019-04-02T19:39:01.274 CEST] [debug] backend got TERM
[2019-04-02T19:39:01.275 CEST] [debug] Destroying openQA-SUT-1 virtual machine
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":60535"
      after 15744 requests (15744 known processed) with 0 events remaining.
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":57603"
      after 15740 requests (15583 known processed) with 0 events remaining.
[2019-04-02T19:39:01.290 CEST] [debug] signalhandler got TERM
[2019-04-02T19:39:01.290 CEST] [debug] killing backend process 657476

Reproducible

Fails since Build 204.1
in scenario sle-15-SP1-Installer-DVD-s390x-gnome+proxy_SCC+allmodules@s390x-kvm-sle12

Latest builds are working again.

Expected result

Last good: 199.1

Further details

Always latest result in this scenario: latest


Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #50225: Make JOB_TIMEOUT incompletes more obviousResolvedmkittler2019-04-10

Actions
Related to openQA Tests (public) - action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out"Rejectedszarate2018-12-01

Actions
Actions #1

Updated by mgriessmeier over 5 years ago

  • Project changed from openQA Tests (public) to openQA Project (public)
  • Subject changed from [functional][u] test ends incomplete - os-autoinst finishes with result: timeout to [tools][functional][u] test ends incomplete - os-autoinst finishes with result: timeout
  • Description updated (diff)
  • Category changed from Bugs in existing tests to Regressions/Crashes
  • Status changed from New to Workable
  • Target version set to Milestone 24
Actions #2

Updated by szarate over 5 years ago

I would suggest that we can create a needle that would try to match the current screen: https://openqa.suse.de/tests/2843000#step/hostname_inst/5, if it matches (except for the ip address), mark it as a failed test, with a reference to either a progress ticket or a bsc. That should be good enogh for the first run

Actions #3

Updated by zluo over 5 years ago

  • Status changed from Workable to In Progress
  • Assignee set to zluo

take over and check.

Actions #4

Updated by mkittler over 5 years ago

Note that the message

[2019-04-02T18:12:34.094 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 171.

does not mean the test timed out after reaching the configured timeout of usually 2 hours.

It just means that isotovideo could not tell the websocket clients that the test execution ended. This is required for the developer mode to suppress irritating error messages. This debug message has nothing to do with the test execution itself. It is printed after the test execution has already ended.

I'm not sure why it fails, though. But it looks like it actually just took more than 15 seconds to distribute that message to the openqa-livehandler daemon.

Actions #5

Updated by mkittler over 5 years ago

Looking at the job history there are indeed timed-out jobs. But still, the error message stated under 'Observation' has nothing to do with these timeouts.

It seems that the test execution time has been reduced in more recent testruns (e.g. https://openqa.suse.de/tests/2862771).

Actions #6

Updated by zluo over 5 years ago

compare
normal test run takes less than 60 minutes:
https://openqa.suse.de/tests/2862771/file/video.ogv at 3:29

with incomplete test run takes about 3 hours:
https://openqa.suse.de/tests/2763448/file/video.ogv at end 5:00
running test code from apachetest.pm

I found the performance from incomplete test is poor, but I don't think it takes really exact 3:00 hours.
If SUT is not done with test, then It should stop kvm guest for running further test modules, but this is failed which mentioned in

isotovideo: unable to inform websocket clients about stopping command server

Actions #7

Updated by zluo over 5 years ago

so what I can say: all tests till consoletest_finish runs completely and successfully.
We have for this case only a problem for: isotovideo send command to end running tests on SUT after 3:00 hours, but SUT is overloaded somehow and it cannot stop working and it continues running till end of tests.

Actions #9

Updated by mkittler over 5 years ago

@zluo

If SUT is not done with test, then It should stop kvm guest for running further test modules, but this is failed which mentioned in

The message 'isotovideo: unable to inform websocket clients about stopping command server' and the problem that the SUT continues to run are not related.

Actions #10

Updated by zluo over 5 years ago

over 100 test runs on osd don't show this issue.
2 failures: ssc_registration (executed by grenache-1:13) and reconnect_mgmt_console (executed by grenache-1:14) which are not related with this issue.

Actions #11

Updated by zluo over 5 years ago

compare successful test runs:
https://openqa.suse.de/tests/2867385/file/autoinst-log.txt

[2019-05-07T10:59:08.462 CEST] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/console/consoletest_finish.pm:40 called testapi::wait_still_screen
[2019-05-07T10:59:08.462 CEST] [debug] <<< testapi::wait_still_screen(timeout=30, stilltime=2, similarity_level=47)
[2019-05-07T10:59:10.502 CEST] [debug] >>> testapi::wait_still_screen: detected same image for 2 seconds, last detected similarity is 1000000
[2019-05-07T10:59:10.729 CEST] [debug] ||| finished consoletest_finish console at 2019-05-07 08:59:10 (13 s)
[2019-05-07T10:59:10.730 CEST] [debug] killing autotest process 333213
[2019-05-07T10:59:10.733 CEST] [debug] [autotest] process exited: 0
[2019-05-07T10:59:11.734 CEST] [debug] done with autotest process
[2019-05-07T10:59:11.734 CEST] [debug] killing command server 333212 because test execution ended
[2019-05-07T10:59:11.734 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/heeley63kbisPxst/broadcast
[2019-05-07T10:59:12.744 CEST] [debug] commands process exited: 0
[2019-05-07T10:59:13.745 CEST] [debug] done with command server
[2019-05-07T10:59:13.745 CEST] [debug] isotovideo done
[2019-05-07T10:59:13.843 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-05-07T10:59:13.843 CEST] [debug] <<< backend::svirt::run_cmd(Net::SSH2=SCALAR(0x100205e8c00)='! virsh  dominfo openQA-SUT-1 | grep -w \'shut off\'', hostname='s390p8.suse.de', password='.zKVM.')
[2019-05-07T10:59:13.931 CEST] [debug] Command executed: '! virsh  dominfo openQA-SUT-1 | grep -w 'shut off'', ret=0
[2019-05-07T10:59:13.943 CEST] [debug] BACKEND SHUTDOWN 0
[2019-05-07T10:59:13.944 CEST] [debug] Destroying openQA-SUT-1 virtual machine
[2019-05-07T10:59:14.030 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-05-07T10:59:14.030 CEST] [debug] <<< backend::svirt::run_cmd(Net::SSH2=SCALAR(0x1003dc3d870)='virsh destroy openQA-SUT-1', password='.zKVM.', hostname='s390p8.suse.de')
[2019-05-07T10:59:14.562 CEST] [debug] Command's stdout:
Domain openQA-SUT-1 destroyed

with

https://openqa.suse.de/tests/2763448/file/worker-log.txt

[2019-04-02T19:39:00.0200 CEST] [warn] [pid:542426] max job time exceeded, aborting 02763448-sle-15-SP1-Installer-DVD-s390x-Build204.1-gnome+proxy_SCC+allmodules@s390x-kvm-sle12
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] stop_job timeout
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] ## removing timer update_status
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] ## removing timer job_timeout
[2019-04-02T19:39:00.0201 CEST] [debug] [pid:542426] POST http://openqa.suse.de/api/v1/jobs/2763448/status
[2019-04-02T19:39:00.0272 CEST] [info] [pid:542426] trying to stop job gracefully by announcing it to command server via http://localhost:20123/FLA9DxoQcuJX3vgy/broadcast
[2019-04-02T19:39:00.0273 CEST] [error] [pid:542426] unable to stop the command server gracefully: 
[2019-04-02T19:39:00.0273 CEST] [error] [pid:542426] command server likely not reachable at all
[2019-04-02T19:39:09.0054 CEST] [info] [pid:542426] uploading logs_from_installation_system-y2logs.tar.bz2

we can see only difference here is something wrong with command server which is not reachable.

Actions #12

Updated by zluo over 5 years ago

https://openqa.suse.de/tests/2763448/file/autoinst-log.txt

started with time [2019-04-02T16:38:49.0684 CEST] till [2019-04-02T18:12:19.078 CEST] it took about 1 hour 34 minutes to get all tests finished.

in worker.log we see
[2019-04-02T19:39:00.0200 CEST] max job time exceeded

Actions #13

Updated by mkittler over 5 years ago

  • Priority changed from High to Normal

After a closer look it seems like the following happens:

  1. All test modules are executed.
  2. isotovideo tries to inform web socket clients that the command server that it is going to terminate. That this does not work which might indicate that the command server is too busy at that time. However, it does not cause everything to hang. Instead ...
  3. ... isotovideo attempts to shutdown the backend. This is where the timeout is exceeded. The last line logged by the backend is [2019-04-02T18:12:37.182 CEST] [debug] Command executed: virsh undefine --snapshots-metadata openQA-SUT-1, ret=0. So apparently that command or some command after it (which is not logged) hangs.
  4. The worker kills everything after the timeout happened.

The command mentioned under 3. is ran via SSH by the svirt backend. So either virsh undefine itself hang or the SSH connection (not sure what the timeout for SSH is).

Since the latest jobs doesn't seem to run into the problem anymore I set the priority to normal.

@zulo So this is an issue within the svirt backend. You can unassign yourself if you like somebody else to take over.

Actions #14

Updated by mkittler over 5 years ago

  • Related to action #50225: Make JOB_TIMEOUT incompletes more obvious added
Actions #15

Updated by zluo over 5 years ago

  • Status changed from In Progress to Rejected

unassign myself at first.

Actions #16

Updated by zluo over 5 years ago

  • Status changed from Rejected to Workable
  • Assignee deleted (zluo)

reject this ticket since #50225 will handle this issue.

Actions #17

Updated by mkittler over 5 years ago

  • Subject changed from [tools][functional][u] test ends incomplete - os-autoinst finishes with result: timeout to [tools][functional][u] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes
  • Description updated (diff)

I adjusted the ticket description and AC because the os-autoinst log actually allows to figure out what's taking so long. Just look at the timestamps beside the log messages. The real problem is that the svirt backend hangs. The log messages stated under observation were also irrelevant.

If the tools team should take over you can change the target version.

Actions #18

Updated by mkittler over 5 years ago

  • Description updated (diff)
Actions #19

Updated by mkittler over 5 years ago

The virsh command itself does not hang because Command executed is really printed after the command is executed.

This leaves only the SSH connection. After Command executed the SSH channel is attempted to be closed. Then it might download a file via SCP if open_serial_console_via_ssh has been called previously. But that's actually not the case in the failed job. Not sure whether closing the SSH channel would cause this timeout. But we actullay don't set any timeout for the SSH connection so the default of having no timeout applies (https://metacpan.org/pod/Net::SSH2#timeout-(-timeout_ms-)).

Than it is 'flushing frames'. Maybe it hangs there. It could be the case because that involves blocking IPC with the video encoder. But then it likely wouldn't be an svirt backend-specific issue.

I guess we have to add more debug printing to find out where it actually hangs and wait until it happens again.

Actions #20

Updated by SLindoMansilla over 5 years ago

  • Status changed from Workable to New
  • Priority changed from Normal to High

For next grooming

Actions #21

Updated by SLindoMansilla over 5 years ago

  • Related to action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out" added
Actions #22

Updated by SLindoMansilla over 5 years ago

  • Related to deleted (action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out")
Actions #23

Updated by SLindoMansilla over 5 years ago

  • Blocked by action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out" added
Actions #24

Updated by SLindoMansilla over 5 years ago

  • Status changed from New to Blocked

Blocked by #46919

Actions #25

Updated by mkittler over 5 years ago

PR for debugging whether it gets stuck before or after 'flushing frames': https://github.com/os-autoinst/os-autoinst/pull/1148

I'm not that sure why this is blocked by "Connection timed out".

Actions #26

Updated by mgriessmeier over 5 years ago

  • Target version changed from Milestone 24 to Milestone 25

move to M25

Actions #27

Updated by SLindoMansilla over 5 years ago

  • Assignee set to mgriessmeier

As discussed in the grooming meeting

Actions #28

Updated by mgriessmeier over 5 years ago

  • Target version changed from Milestone 25 to Milestone 26
Actions #29

Updated by mgriessmeier over 5 years ago

  • Target version changed from Milestone 26 to Milestone 28
Actions #30

Updated by coolo about 5 years ago

Perhaps you stop picking assignees if that assignee doesn't work on it for half a year?

Actions #31

Updated by SLindoMansilla about 5 years ago

  • Status changed from Blocked to New

It is not a team bug, it is a team feature.
An assignee to a blocked ticket doesn't mean that he will work on it, but that he will track the blocker and update the ticket properly ;)

Blocker rejected: #46919

Actions #32

Updated by coolo about 5 years ago

If it's a team feature, perhaps it shouldn't be in Category 'Concrete Bugs'?

Actions #33

Updated by mkittler about 5 years ago

And I'm still wondering why this is blocked by "Connection timed out" (maybe the tickets are just related?).

Actions #34

Updated by SLindoMansilla about 5 years ago

mkittler wrote:

And I'm still wondering why this is blocked by "Connection timed out" (maybe the tickets are just related?).

It is not blocked, the status is "new" now.
If you refer to the relation, it is because it was blocked. Removing the relation would remove information. A relation of blocked by with a closed ticket (see strikethrough) is not considered by U-Team as blocked anymore.

Actions #35

Updated by szarate about 5 years ago

  • Status changed from New to Blocked
Actions #36

Updated by mgriessmeier almost 5 years ago

  • Target version changed from Milestone 28 to Milestone 30

needs to be discussed offline

Actions #37

Updated by mkittler almost 5 years ago

Is it still happening? If that's the case, is the log messag flushing frames printed?

Actions #38

Updated by mgriessmeier almost 5 years ago

  • Status changed from Blocked to In Progress

haven't seen for a long time, closing for now

Actions #39

Updated by mgriessmeier almost 5 years ago

  • Subject changed from [tools][functional][u] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes to [tools] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes
  • Status changed from In Progress to New
  • Assignee deleted (mgriessmeier)
  • Priority changed from High to Normal

or at least unassigning, removing from [u] team and lowering prio

Actions #40

Updated by okurz almost 5 years ago

  • Blocked by deleted (action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out")
Actions #41

Updated by okurz almost 5 years ago

  • Related to action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out" added
Actions #42

Updated by okurz almost 5 years ago

  • Subject changed from [tools] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes to Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes
  • Status changed from New to Rejected
  • Assignee set to okurz

If the ACs are not fulfilled I guess this can happen anytime again but if you are not affected by this I guess not anyone else is for now.

Actions

Also available in: Atom PDF