Project

General

Profile

Actions

action #16616

closed

ppc64le tests die/timeout while saving snapshot

Added by rpalethorpe about 7 years ago. Updated over 6 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2017-02-09
Due date:
% Done:

0%

Estimated time:

Description

In the following case it clearly shows that the test timed out while waiting for a response from QEMU. In other cases it is not clear to me why the test dies, but it seems to happen at the same point (where a snapshot is saved). I thought there would be an existing ticket for this, but could not find it.

https://openqa.suse.de/tests/762741

Hypothesises

  • H1, It takes too long to save the snapshot and times out, but would complete if given enough time.
  • H2, QEMU crashes
  • H3, The storage is unreachable or broken
  • H4, The socket is misread by os-autoinst

H1 seems the most likely by far.

Potential Actions

  • A1, Increase the timeout
  • A2, Increase the storage or compression performance
  • A3, Stress test OpenQA to recreate the bug and investigate further

A1 is easiest, A2 and A3 may be more profitable, but maybe too difficult for now.

Workarounds

Simply restart the test manually.

Actions #1

Updated by coolo about 7 years ago

We need SSDs for all our production servers :(

Actions #2

Updated by okurz about 7 years ago

I agree with A1 if that can be done

Actions #3

Updated by okurz about 7 years ago

  • Category set to Feature requests

ok, we can prevent this from happening by putting SSDs in all production servers or my bumping up the timeout but I think the user feedback in case of error can be improved so that we can distinguish between H1-H4

Actions #4

Updated by okurz about 7 years ago

just found a similar one on openqaworker1.opensuse.org (same on two worker processes)

22:23:40.5673 13774 Creating a VM snapshot lastgood
DIE ERROR: timeout reading hmp socket

 at /usr/lib/os-autoinst/backend/baseclass.pm line 73.
        backend::baseclass::die_handler('ERROR: timeout reading hmp socket\x{a}') called at /usr/lib/os-autoinst/backend/qemu.pm line 903
        backend::qemu::_read_hmp('backend::qemu=HASH(0x6d7b0e0)') called at /usr/lib/os-autoinst/backend/qemu.pm line 971
        backend::qemu::_send_hmp('backend::qemu=HASH(0x6d7b0e0)', 'savevm lastgood') called at /usr/lib/os-autoinst/backend/qemu.pm line 207
        backend::qemu::save_snapshot('backend::qemu=HASH(0x6d7b0e0)', 'HASH(0x74b38d0)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 68
        backend::baseclass::handle_command('backend::qemu=HASH(0x6d7b0e0)', 'HASH(0x74b2cf0)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 422
        backend::baseclass::check_socket('backend::qemu=HASH(0x6d7b0e0)', 'IO::Handle=GLOB(0x6f286f8)') called at /usr/lib/os-autoinst/backend/qemu.pm line 998
        backend::qemu::check_socket('backend::qemu=HASH(0x6d7b0e0)', 'IO::Handle=GLOB(0x6f286f8)', 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 203
        eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 151
        backend::baseclass::run_capture_loop('backend::qemu=HASH(0x6d7b0e0)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 122
        backend::baseclass::run('backend::qemu=HASH(0x6d7b0e0)', 6, 9) called at /usr/lib/os-autoinst/backend/driver.pm line 85
        backend::driver::start('backend::driver=HASH(0x58e0830)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
        backend::driver::new('backend::driver', 'qemu') called at /usr/bin/isotovideo line 197
        main::init_backend() called at /usr/bin/isotovideo line 268
22:28:41.0044 13775 waitpid for 13780 returned 0
22:28:41.0044 13775 sending TERM to qemu pid: 13780
22:28:41.2503 13766 signalhandler got TERM - loop 1
22:28:41.2505 13766 awaiting death of commands process
22:28:41.2531 13766 commands process exited: 13768
22:28:41.2532 13766 awaiting death of testpid 13774
22:28:41.2563 13766 test process exited: 13774
22:28:41.2563 13766 isotovideo failed
22:28:41.2570 13766 killing backend process 13775
22:28:41.2571 13775 backend got TERM
22:28:41.2572 13775 waitpid for 13780 returned 0
22:28:41.2573 13775 sending TERM to qemu pid: 13780
22:28:42.2574 13775 waitpid for 13780 returned 0
22:28:43.2577 13775 waitpid for 13780 returned 0
22:28:44.2579 13775 waitpid for 13780 returned 0
22:28:45.2583 13775 waitpid for 13780 returned 0
22:28:46.2585 13775 waitpid for 13780 returned 0
last frame
22:31:14.7803 13775 sending magic and exit
22:31:14.7988 13766 done with backend process
13766: EXIT 1
+++ worker notes +++
end time: 2017-02-28 22:31:14
result: crashed
Actions #5

Updated by okurz over 6 years ago

latest appearence on x86_64: https://openqa.suse.de/tests/1236440

Actions #6

Updated by coolo over 6 years ago

  • Status changed from New to Rejected

we track that as https://bugzilla.suse.com/show_bug.cgi?id=1035453 and https://bugzilla.redhat.com/show_bug.cgi?id=1483765

5 minutes is just too much - increasing this won't help.

Actions #7

Updated by rpalethorpe over 6 years ago

OK, I have actually just disabled snapshots for LTP (except saving disk image after installation).

Actions

Also available in: Atom PDF