Project

General

Profile

Actions

action #13532

closed

DIE short read for zlre data on 'svirt' on KVM/Xen hosts

Added by michalnowak over 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2016-09-01
Due date:
% Done:

0%

Estimated time:

Description

Originally filed at: https://github.com/os-autoinst/os-autoinst/issues/572.

From time to time os-autoinst fails on KVM and Xen hosts (svirt backend) with:

DIE short read for zlre data 13410 - 994937 at /usr/lib/os-autoinst/consoles/VNC.pm line 932.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 74.
    backend::baseclass::die_handler('short read for zlre data 13410 - 994937 at /usr/lib/os-autoin...') called at /usr/lib/os-autoinst/consoles/VNC.pm line 932
    consoles::VNC::_receive_zlre_encoding('consoles::VNC=HASH(0x5d41f48)', 0, 0, 1024, 767) called at /usr/lib/os-autoinst/consoles/VNC.pm line 860
    consoles::VNC::_receive_update('consoles::VNC=HASH(0x5d41f48)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 804
    consoles::VNC::_receive_message('consoles::VNC=HASH(0x5d41f48)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 756
    consoles::VNC::update_framebuffer('consoles::VNC=HASH(0x5d41f48)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 112
    consoles::vnc_base::current_screen('consoles::vnc_base=HASH(0xe62fb0)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 566
    backend::baseclass::capture_screenshot('backend::svirt=HASH(0x47b0b88)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 194
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 165
    backend::baseclass::run_capture_loop('backend::svirt=HASH(0x47b0b88)', undef, 0.2, 0.19) called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 166
    consoles::vnc_base::send_key('consoles::vnc_base=HASH(0xe62fb0)', 'HASH(0x62a69d8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 519
    backend::baseclass::bouncer('backend::svirt=HASH(0x47b0b88)', 'send_key', 'HASH(0x62a69d8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 524
    backend::baseclass::send_key('backend::svirt=HASH(0x47b0b88)', 'HASH(0x62a69d8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 69
    backend::baseclass::handle_command('backend::svirt=HASH(0x47b0b88)', 'HASH(0x629a398)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 421
    backend::baseclass::check_socket('backend::svirt=HASH(0x47b0b88)', 'IO::Handle=GLOB(0x4ee5e50)', 0) called at /usr/lib/os-autoinst/backend/svirt.pm line 96
    backend::svirt::check_socket('backend::svirt=HASH(0x47b0b88)', 'IO::Handle=GLOB(0x4ee5e50)', 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 203
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 165
    backend::baseclass::run_capture_loop('backend::svirt=HASH(0x47b0b88)', 'IO::Select=ARRAY(0x3b4eb28)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 114
    backend::baseclass::run('backend::svirt=HASH(0x47b0b88)', 6, 17) called at /usr/lib/os-autoinst/backend/driver.pm line 85
    backend::driver::start('backend::driver=HASH(0x4770980)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
    backend::driver::new('backend::driver', 'svirt') called at /usr/bin/isotovideo line 168
    main::init_backend() called at /usr/bin/isotovideo line 222

Usually on the same place, e.g. editing Grub command line in JeOS (http://assam.suse.cz/tests/3046/#downloads), three times in row and then disappears. Systems under tests are 12SP1 JeOS and SLES 12 SP2.

The servers are SLES 12 SP2 RC1, haven't seen it on SLES 12 SP1:
kernel-default-4.4.19-60.1.x86_64
libvirt 2.0.0-21.7.x86_64
qemu-2.6.1-25.8.x86_64


Files

strace.log.ZRLE.xz (1.24 MB) strace.log.ZRLE.xz michalnowak, 2017-02-15 13:59

Related issues 1 (0 open1 closed)

Blocks openQA Tests - action #10206: [tools]libvirt tests (Xen, Hyper-V, VMware)Resolvedmichalnowak2016-01-13

Actions
Actions #1

Updated by michalnowak over 7 years ago

I can reproduce this issue 90 % of times on openqaw6-kvm.qa.suse.de which runs SLES 12 SP2 with updates. However I can't reproduce it on my local machine which runs Leap 42.2 with updates. That's one difference, the other is remote v. local host -- I tried to mitigate impact of timing but of not avail. Clearly the behavior differs as SLES presents 720x400 px resolution in BIOS, but Leap does 800x600 px.

I suggest to migrate the machine, as well as openqaw5-xen.qa.suse.de, to what we use for openQA workers, i.e. Leap, without the openqa-worker started there.

Then we would have on svirt backend the same configuration we have for qemu backend, as there are SUTs started on Leap as well, if I understand it correctly.

Actions #2

Updated by michalnowak over 7 years ago

Actually it's reproducible on Leap 42.2. as well. Even if I update Qemu to 2.8.0 (the latest). I believe it's a Qemu bug, reported as bsc#1016968. Strangely I am unable to reproduce it with Leap 42.1 JeOS as a guest VM.

Actions #3

Updated by okurz about 7 years ago

IMHO the "virtualization" team should test virtualization from both perspective of hypervisor host as well as guest. The current "virtualization" job group concentrates more on the hypervisor, the svirt backend is much more suitable for testing the guest as SUT under the control of different hypervisors. As this ticket here refers to the latter while taking the backend as a testing prerequisite I agree with you that the infrastructure should be more harmonized and the worker host should have the same installation as the other worker hosts. But aren't the tests running on openqaworker2 which is at least as of today Leap 42.2?

Actions #4

Updated by okurz about 7 years ago

  • Description updated (diff)
Actions #5

Updated by michalnowak about 7 years ago

okurz wrote:

But aren't the tests running on openqaworker2 which is at least as of today Leap 42.2?

I don't speak about openQA worker, but the virtualization hosts openqaw5-xen.qa.suse.de and openqaw6-kvm.qa.suse.de we use as intermediaries. They are the VM hosts where VMs via svirt backend are actually started, i.e. they are not started on the openQA worker itself as it is with qemu backend.

Actions #6

Updated by okurz about 7 years ago

I see, can you document the purpose and setup of these hosts on https://wiki.microfocus.net/index.php/OpenQA#Hardware accordingly? or is it documented already somewhere else? then please reference the according documentation.

Actions #7

Updated by michalnowak about 7 years ago

  • Category deleted (132)

okurz wrote:

I see, can you document the purpose and setup of these hosts on https://wiki.microfocus.net/index.php/OpenQA#Hardware accordingly? or is it documented already somewhere else? then please reference the according documentation.

Added some information to https://wiki.microfocus.net/index.php/OpenQA#Hardware, hopefully it's sufficient.

Actions #8

Updated by okurz about 7 years ago

  • Category set to 168

good. But now the question still holds how the machines are administered? AFAICS https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/top.sls will not touch these hosts because no rule matches these. IMHO they should be administered with salt in the same way as the others.

Actions #9

Updated by coolo about 7 years ago

I don't really know what we can do here. The socket is blocking, so a read of 0 means EOF. All we can try is reconnecting

Actions #10

Updated by michalnowak about 7 years ago

coolo wrote:

I don't really know what we can do here. The socket is blocking, so a read of 0 means EOF. All we can try is reconnecting

Yes, that would mitigate the bug in Qemu (https://bugzilla.suse.com/show_bug.cgi?id=1016968).

Actions #11

Updated by michalnowak about 7 years ago

  • Status changed from New to Resolved

With de19bdcc68392abd37adbfe151c0506d02f3525a ZLRE is gone by default, so the problem on svirt is.

The next best thing frequent 'considering VNC stalled - turning black'.

Actions #12

Updated by michalnowak about 7 years ago

  • Status changed from Resolved to New

ZLRE feature removal reverted. The problem is back: https://openqa.suse.de/tests/764295/file/autoinst-log.txt.

I guess I need to disable it explicitly on svirt as Dell was: https://github.com/os-autoinst/os-autoinst/pull/719/commits/99aface7af9cf75f882ac30644f15c26324a37bc.

Actions #13

Updated by coolo about 7 years ago

Be careful though as for zKVM raw is broken and zlre works :(

Actions #14

Updated by michalnowak about 7 years ago

While gathering logs for ticket #16418 I noticed, how easy it is to get hit by this issue (95 % of time). Attaching strace log, perhaps there's something useful.

Actions #15

Updated by michalnowak about 7 years ago

  • Blocks action #10206: [tools]libvirt tests (Xen, Hyper-V, VMware) added
Actions #16

Updated by RBrownSUSE about 7 years ago

michalnowak wrote:

I can reproduce this issue 90 % of times on openqaw6-kvm.qa.suse.de which runs SLES 12 SP2 with updates. However I can't reproduce it on my local machine which runs Leap 42.2 with updates. That's one difference, the other is remote v. local host -- I tried to mitigate impact of timing but of not avail. Clearly the behavior differs as SLES presents 720x400 px resolution in BIOS, but Leap does 800x600 px.

I suggest to migrate the machine, as well as openqaw5-xen.qa.suse.de, to what we use for openQA workers, i.e. Leap, without the openqa-worker started there.

Then we would have on svirt backend the same configuration we have for qemu backend, as there are SUTs started on Leap as well, if I understand it correctly.

If the bug is with SLES 12 SP2, what is the bug ID for the SLES 12 SP2 bug?

I am okay considering the migration to Leap but that is a MAJOR request with huge impacts, we should at least consider you know, fixing the issue in SLE 12 SP2 as people who pay us actually use it..

Actions #17

Updated by michalnowak about 7 years ago

  • Priority changed from Normal to High

RBrownSUSE wrote:

If the bug is with SLES 12 SP2, what is the bug ID for the SLES 12 SP2 bug?

See comment #10.

I am okay considering the migration to Leap but that is a MAJOR request with huge impacts, we should at least consider you know, fixing the issue in SLE 12 SP2 as people who pay us actually use it..

I don't actually need move to Leap, it wouldn't help. SLES 12 SP1 would workaround it, if I am correct in https://bugzilla.suse.com/show_bug.cgi?id=1016968.

Actions #18

Updated by maritawerner about 7 years ago

  • Priority changed from High to Normal

I have set the Bug to P1 now and I hope I will get an answer from the bugowner soonish.

Actions #19

Updated by maritawerner about 7 years ago

  • Priority changed from Normal to High
Actions #21

Updated by michalnowak almost 7 years ago

Disabled ZRLE on svirt but it did not help either (not to mention other problems), it's just triggered elsewhere:

DIE unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 918.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 73.
    backend::baseclass::die_handler('unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.p...') called at /usr/lib/os-autoinst/consoles/VNC.pm line 918
    consoles::VNC::_receive_update('consoles::VNC=HASH(0x621cdf0)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 880
    consoles::VNC::_receive_message('consoles::VNC=HASH(0x621cdf0)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 782
    consoles::VNC::update_framebuffer('consoles::VNC=HASH(0x621cdf0)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 68
    consoles::vnc_base::request_screen_update('consoles::vnc_base=HASH(0x3b29350)', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 521
    backend::baseclass::bouncer('backend::svirt=HASH(0x4830760)', 'request_screen_update', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 504
    backend::baseclass::request_screen_update('backend::svirt=HASH(0x4830760)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 167
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 151
    backend::baseclass::run_capture_loop('backend::svirt=HASH(0x4830760)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 122
    backend::baseclass::run('backend::svirt=HASH(0x4830760)', 6, 9) called at /usr/lib/os-autoinst/backend/driver.pm line 85
    backend::driver::start('backend::driver=HASH(0x48af0f8)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
    backend::driver::new('backend::driver', 'svirt') called at /usr/bin/isotovideo line 198
    main::init_backend() called at /usr/bin/isotovideo line 263

http://assam.suse.cz/tests/5560/file/autoinst-log.txt

Actions #22

Updated by coolo almost 7 years ago

  • Assignee set to coolo

How would you recommend reproducing this easiest?

Actions #23

Updated by michalnowak almost 7 years ago

coolo wrote:

How would you recommend reproducing this easiest?

Clone any JeOS on Xen HVM e.g. https://openqa.suse.de/tests/858635, make sure all VIRSH_* variables are set, set VIRSH_INSTANCE to higher number than 1 not to collude with OSD jobs. On worker file "HDD_1" has to exist, e.g. as a non-zero sized file, but a image actually used is downloaded on the VM host automatically.

It should fail in GRUB, either in bootloader_uefi, redefine_svirt_domain or console_reboot test module.

Actions #24

Updated by coolo almost 7 years ago

  • Status changed from New to Feedback
Actions #25

Updated by coolo almost 7 years ago

  • Status changed from Feedback to Resolved

tested and merged

Actions

Also available in: Atom PDF