action #49961

[tools][functional][u] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes

Added by SLindoMansilla 11 months ago. Updated about 1 month ago.

Status:BlockedStart date:03/04/2019
Priority:HighDue date:
Assignee:mgriessmeier% Done:

0%

Category:Concrete Bugs
Target version:SUSE QA tests - Milestone 30
Difficulty:
Duration:

Description

Acceptance criteria

  • AC: Prevent svirt backend to hang on virsh undefine command, e.g. reduce the timeout for this command.

Observation


[2019-04-02T18:12:36.938 CEST] [debug] Command executed: virsh  destroy openQA-SUT-1, ret=0
[2019-04-02T18:12:37.090 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-04-02T18:12:37.182 CEST] [debug] Command's stdout:
Domain openQA-SUT-1 has been undefined


[2019-04-02T18:12:37.182 CEST] [debug] Command executed: virsh  undefine --snapshots-metadata openQA-SUT-1, ret=0 <<<<<<<< apparently it hangs after attempting to execute this command
[2019-04-02T19:39:01.274 CEST] [debug] backend got TERM
[2019-04-02T19:39:01.275 CEST] [debug] Destroying openQA-SUT-1 virtual machine
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":60535"
      after 15744 requests (15744 known processed) with 0 events remaining.
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":57603"
      after 15740 requests (15583 known processed) with 0 events remaining.
[2019-04-02T19:39:01.290 CEST] [debug] signalhandler got TERM
[2019-04-02T19:39:01.290 CEST] [debug] killing backend process 657476

Reproducible

Fails since Build 204.1
in scenario sle-15-SP1-Installer-DVD-s390x-gnome+proxy_SCC+allmodules@s390x-kvm-sle12

Latest builds are working again.

Expected result

Last good: 199.1

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Project - action #50225: Make JOB_TIMEOUT incompletes more obvious Resolved 10/04/2019
Blocked by openQA Tests - action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket:... New 01/12/2018

History

#1 Updated by mgriessmeier 10 months ago

  • Project changed from openQA Tests to openQA Project
  • Subject changed from [functional][u] test ends incomplete - os-autoinst finishes with result: timeout to [tools][functional][u] test ends incomplete - os-autoinst finishes with result: timeout
  • Description updated (diff)
  • Category changed from Bugs in existing tests to Concrete Bugs
  • Status changed from New to Workable
  • Target version set to Milestone 24

#2 Updated by szarate 10 months ago

I would suggest that we can create a needle that would try to match the current screen: https://openqa.suse.de/tests/2843000#step/hostname_inst/5, if it matches (except for the ip address), mark it as a failed test, with a reference to either a progress ticket or a bsc. That should be good enogh for the first run

#3 Updated by zluo 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to zluo

take over and check.

#4 Updated by mkittler 10 months ago

Note that the message

[2019-04-02T18:12:34.094 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 171.

does not mean the test timed out after reaching the configured timeout of usually 2 hours.

It just means that isotovideo could not tell the websocket clients that the test execution ended. This is required for the developer mode to suppress irritating error messages. This debug message has nothing to do with the test execution itself. It is printed after the test execution has already ended.

I'm not sure why it fails, though. But it looks like it actually just took more than 15 seconds to distribute that message to the openqa-livehandler daemon.

#5 Updated by mkittler 10 months ago

Looking at the job history there are indeed timed-out jobs. But still, the error message stated under 'Observation' has nothing to do with these timeouts.

It seems that the test execution time has been reduced in more recent testruns (e.g. https://openqa.suse.de/tests/2862771).

#6 Updated by zluo 10 months ago

compare
normal test run takes less than 60 minutes:
https://openqa.suse.de/tests/2862771/file/video.ogv at 3:29

with incomplete test run takes about 3 hours:
https://openqa.suse.de/tests/2763448/file/video.ogv at end 5:00
running test code from apachetest.pm

I found the performance from incomplete test is poor, but I don't think it takes really exact 3:00 hours.
If SUT is not done with test, then It should stop kvm guest for running further test modules, but this is failed which mentioned in

isotovideo: unable to inform websocket clients about stopping command server

#7 Updated by zluo 10 months ago

so what I can say: all tests till consoletest_finish runs completely and successfully.
We have for this case only a problem for: isotovideo send command to end running tests on SUT after 3:00 hours, but SUT is overloaded somehow and it cannot stop working and it continues running till end of tests.

#9 Updated by mkittler 10 months ago

@zluo

If SUT is not done with test, then It should stop kvm guest for running further test modules, but this is failed which mentioned in

The message 'isotovideo: unable to inform websocket clients about stopping command server' and the problem that the SUT continues to run are not related.

#10 Updated by zluo 10 months ago

over 100 test runs on osd don't show this issue.
2 failures: ssc_registration (executed by grenache-1:13) and reconnect_mgmt_console (executed by grenache-1:14) which are not related with this issue.

#11 Updated by zluo 10 months ago

compare successful test runs:
https://openqa.suse.de/tests/2867385/file/autoinst-log.txt

[2019-05-07T10:59:08.462 CEST] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/console/consoletest_finish.pm:40 called testapi::wait_still_screen
[2019-05-07T10:59:08.462 CEST] [debug] <<< testapi::wait_still_screen(timeout=30, stilltime=2, similarity_level=47)
[2019-05-07T10:59:10.502 CEST] [debug] >>> testapi::wait_still_screen: detected same image for 2 seconds, last detected similarity is 1000000
[2019-05-07T10:59:10.729 CEST] [debug] ||| finished consoletest_finish console at 2019-05-07 08:59:10 (13 s)
[2019-05-07T10:59:10.730 CEST] [debug] killing autotest process 333213
[2019-05-07T10:59:10.733 CEST] [debug] [autotest] process exited: 0
[2019-05-07T10:59:11.734 CEST] [debug] done with autotest process
[2019-05-07T10:59:11.734 CEST] [debug] killing command server 333212 because test execution ended
[2019-05-07T10:59:11.734 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/heeley63kbisPxst/broadcast
[2019-05-07T10:59:12.744 CEST] [debug] commands process exited: 0
[2019-05-07T10:59:13.745 CEST] [debug] done with command server
[2019-05-07T10:59:13.745 CEST] [debug] isotovideo done
[2019-05-07T10:59:13.843 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-05-07T10:59:13.843 CEST] [debug] <<< backend::svirt::run_cmd(Net::SSH2=SCALAR(0x100205e8c00)='! virsh  dominfo openQA-SUT-1 | grep -w \'shut off\'', hostname='s390p8.suse.de', password='.zKVM.')
[2019-05-07T10:59:13.931 CEST] [debug] Command executed: '! virsh  dominfo openQA-SUT-1 | grep -w 'shut off'', ret=0
[2019-05-07T10:59:13.943 CEST] [debug] BACKEND SHUTDOWN 0
[2019-05-07T10:59:13.944 CEST] [debug] Destroying openQA-SUT-1 virtual machine
[2019-05-07T10:59:14.030 CEST] [debug] Connection to root@s390p8.suse.de established
[2019-05-07T10:59:14.030 CEST] [debug] <<< backend::svirt::run_cmd(Net::SSH2=SCALAR(0x1003dc3d870)='virsh destroy openQA-SUT-1', password='.zKVM.', hostname='s390p8.suse.de')
[2019-05-07T10:59:14.562 CEST] [debug] Command's stdout:
Domain openQA-SUT-1 destroyed

with

https://openqa.suse.de/tests/2763448/file/worker-log.txt

[2019-04-02T19:39:00.0200 CEST] [warn] [pid:542426] max job time exceeded, aborting 02763448-sle-15-SP1-Installer-DVD-s390x-Build204.1-gnome+proxy_SCC+allmodules@s390x-kvm-sle12
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] stop_job timeout
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] ## removing timer update_status
[2019-04-02T19:39:00.0200 CEST] [debug] [pid:542426] ## removing timer job_timeout
[2019-04-02T19:39:00.0201 CEST] [debug] [pid:542426] POST http://openqa.suse.de/api/v1/jobs/2763448/status
[2019-04-02T19:39:00.0272 CEST] [info] [pid:542426] trying to stop job gracefully by announcing it to command server via http://localhost:20123/FLA9DxoQcuJX3vgy/broadcast
[2019-04-02T19:39:00.0273 CEST] [error] [pid:542426] unable to stop the command server gracefully: 
[2019-04-02T19:39:00.0273 CEST] [error] [pid:542426] command server likely not reachable at all
[2019-04-02T19:39:09.0054 CEST] [info] [pid:542426] uploading logs_from_installation_system-y2logs.tar.bz2

we can see only difference here is something wrong with command server which is not reachable.

#12 Updated by zluo 10 months ago

https://openqa.suse.de/tests/2763448/file/autoinst-log.txt

started with time [2019-04-02T16:38:49.0684 CEST] till [2019-04-02T18:12:19.078 CEST] it took about 1 hour 34 minutes to get all tests finished.

in worker.log we see
[2019-04-02T19:39:00.0200 CEST] max job time exceeded

#13 Updated by mkittler 10 months ago

  • Priority changed from High to Normal

After a closer look it seems like the following happens:

  1. All test modules are executed.
  2. isotovideo tries to inform web socket clients that the command server that it is going to terminate. That this does not work which might indicate that the command server is too busy at that time. However, it does not cause everything to hang. Instead ...
  3. ... isotovideo attempts to shutdown the backend. This is where the timeout is exceeded. The last line logged by the backend is [2019-04-02T18:12:37.182 CEST] [debug] Command executed: virsh undefine --snapshots-metadata openQA-SUT-1, ret=0. So apparently that command or some command after it (which is not logged) hangs.
  4. The worker kills everything after the timeout happened.

The command mentioned under 3. is ran via SSH by the svirt backend. So either virsh undefine itself hang or the SSH connection (not sure what the timeout for SSH is).

Since the latest jobs doesn't seem to run into the problem anymore I set the priority to normal.

@zulo So this is an issue within the svirt backend. You can unassign yourself if you like somebody else to take over.

#14 Updated by mkittler 10 months ago

  • Related to action #50225: Make JOB_TIMEOUT incompletes more obvious added

#15 Updated by zluo 10 months ago

  • Status changed from In Progress to Rejected

unassign myself at first.

#16 Updated by zluo 10 months ago

  • Status changed from Rejected to Workable
  • Assignee deleted (zluo)

reject this ticket since #50225 will handle this issue.

#17 Updated by mkittler 10 months ago

  • Subject changed from [tools][functional][u] test ends incomplete - os-autoinst finishes with result: timeout to [tools][functional][u] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes
  • Description updated (diff)

I adjusted the ticket description and AC because the os-autoinst log actually allows to figure out what's taking so long. Just look at the timestamps beside the log messages. The real problem is that the svirt backend hangs. The log messages stated under observation were also irrelevant.

If the tools team should take over you can change the target version.

#18 Updated by mkittler 10 months ago

  • Description updated (diff)

#19 Updated by mkittler 10 months ago

The virsh command itself does not hang because Command executed is really printed after the command is executed.

This leaves only the SSH connection. After Command executed the SSH channel is attempted to be closed. Then it might download a file via SCP if open_serial_console_via_ssh has been called previously. But that's actually not the case in the failed job. Not sure whether closing the SSH channel would cause this timeout. But we actullay don't set any timeout for the SSH connection so the default of having no timeout applies (https://metacpan.org/pod/Net::SSH2#timeout-(-timeout_ms-)).

Than it is 'flushing frames'. Maybe it hangs there. It could be the case because that involves blocking IPC with the video encoder. But then it likely wouldn't be an svirt backend-specific issue.

I guess we have to add more debug printing to find out where it actually hangs and wait until it happens again.

#20 Updated by SLindoMansilla 10 months ago

  • Status changed from Workable to New
  • Priority changed from Normal to High

For next grooming

#21 Updated by SLindoMansilla 10 months ago

  • Related to action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out" added

#22 Updated by SLindoMansilla 10 months ago

  • Related to deleted (action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out")

#23 Updated by SLindoMansilla 10 months ago

  • Blocked by action #46919: [functional][u][svirt][sporadic] auto_review:"IO::Socket::INET: connect: Connection timed out" added

#24 Updated by SLindoMansilla 10 months ago

  • Status changed from New to Blocked

Blocked by #46919

#25 Updated by mkittler 10 months ago

PR for debugging whether it gets stuck before or after 'flushing frames': https://github.com/os-autoinst/os-autoinst/pull/1148

I'm not that sure why this is blocked by "Connection timed out".

#26 Updated by mgriessmeier 9 months ago

  • Target version changed from Milestone 24 to Milestone 25

move to M25

#27 Updated by SLindoMansilla 9 months ago

  • Assignee set to mgriessmeier

As discussed in the grooming meeting

#28 Updated by mgriessmeier 8 months ago

  • Target version changed from Milestone 25 to Milestone 26

#29 Updated by mgriessmeier 6 months ago

  • Target version changed from Milestone 26 to Milestone 28

#30 Updated by coolo 4 months ago

Perhaps you stop picking assignees if that assignee doesn't work on it for half a year?

#31 Updated by SLindoMansilla 4 months ago

  • Status changed from Blocked to New

It is not a team bug, it is a team feature.
An assignee to a blocked ticket doesn't mean that he will work on it, but that he will track the blocker and update the ticket properly ;)

Blocker rejected: #46919

#32 Updated by coolo 4 months ago

If it's a team feature, perhaps it shouldn't be in Category 'Concrete Bugs'?

#33 Updated by mkittler 4 months ago

And I'm still wondering why this is blocked by "Connection timed out" (maybe the tickets are just related?).

#34 Updated by SLindoMansilla 4 months ago

mkittler wrote:

And I'm still wondering why this is blocked by "Connection timed out" (maybe the tickets are just related?).

It is not blocked, the status is "new" now.
If you refer to the relation, it is because it was blocked. Removing the relation would remove information. A relation of blocked by with a closed ticket (see strikethrough) is not considered by U-Team as blocked anymore.

#35 Updated by szarate 3 months ago

  • Status changed from New to Blocked

#36 Updated by mgriessmeier about 1 month ago

  • Target version changed from Milestone 28 to Milestone 30

needs to be discussed offline

#37 Updated by mkittler about 1 month ago

Is it still happening? If that's the case, is the log messag flushing frames printed?

Also available in: Atom PDF