action #13532
closedDIE short read for zlre data on 'svirt' on KVM/Xen hosts
0%
Description
Originally filed at: https://github.com/os-autoinst/os-autoinst/issues/572.
From time to time os-autoinst fails on KVM and Xen hosts (svirt backend) with:
DIE short read for zlre data 13410 - 994937 at /usr/lib/os-autoinst/consoles/VNC.pm line 932.
at /usr/lib/os-autoinst/backend/baseclass.pm line 74.
backend::baseclass::die_handler('short read for zlre data 13410 - 994937 at /usr/lib/os-autoin...') called at /usr/lib/os-autoinst/consoles/VNC.pm line 932
consoles::VNC::_receive_zlre_encoding('consoles::VNC=HASH(0x5d41f48)', 0, 0, 1024, 767) called at /usr/lib/os-autoinst/consoles/VNC.pm line 860
consoles::VNC::_receive_update('consoles::VNC=HASH(0x5d41f48)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 804
consoles::VNC::_receive_message('consoles::VNC=HASH(0x5d41f48)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 756
consoles::VNC::update_framebuffer('consoles::VNC=HASH(0x5d41f48)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 112
consoles::vnc_base::current_screen('consoles::vnc_base=HASH(0xe62fb0)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 566
backend::baseclass::capture_screenshot('backend::svirt=HASH(0x47b0b88)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 194
eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 165
backend::baseclass::run_capture_loop('backend::svirt=HASH(0x47b0b88)', undef, 0.2, 0.19) called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 166
consoles::vnc_base::send_key('consoles::vnc_base=HASH(0xe62fb0)', 'HASH(0x62a69d8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 519
backend::baseclass::bouncer('backend::svirt=HASH(0x47b0b88)', 'send_key', 'HASH(0x62a69d8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 524
backend::baseclass::send_key('backend::svirt=HASH(0x47b0b88)', 'HASH(0x62a69d8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 69
backend::baseclass::handle_command('backend::svirt=HASH(0x47b0b88)', 'HASH(0x629a398)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 421
backend::baseclass::check_socket('backend::svirt=HASH(0x47b0b88)', 'IO::Handle=GLOB(0x4ee5e50)', 0) called at /usr/lib/os-autoinst/backend/svirt.pm line 96
backend::svirt::check_socket('backend::svirt=HASH(0x47b0b88)', 'IO::Handle=GLOB(0x4ee5e50)', 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 203
eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 165
backend::baseclass::run_capture_loop('backend::svirt=HASH(0x47b0b88)', 'IO::Select=ARRAY(0x3b4eb28)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 114
backend::baseclass::run('backend::svirt=HASH(0x47b0b88)', 6, 17) called at /usr/lib/os-autoinst/backend/driver.pm line 85
backend::driver::start('backend::driver=HASH(0x4770980)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
backend::driver::new('backend::driver', 'svirt') called at /usr/bin/isotovideo line 168
main::init_backend() called at /usr/bin/isotovideo line 222
Usually on the same place, e.g. editing Grub command line in JeOS (http://assam.suse.cz/tests/3046/#downloads), three times in row and then disappears. Systems under tests are 12SP1 JeOS and SLES 12 SP2.
The servers are SLES 12 SP2 RC1, haven't seen it on SLES 12 SP1:
kernel-default-4.4.19-60.1.x86_64
libvirt 2.0.0-21.7.x86_64
qemu-2.6.1-25.8.x86_64
Files
Updated by michalnowak about 8 years ago
I can reproduce this issue 90 % of times on openqaw6-kvm.qa.suse.de which runs SLES 12 SP2 with updates. However I can't reproduce it on my local machine which runs Leap 42.2 with updates. That's one difference, the other is remote v. local host -- I tried to mitigate impact of timing but of not avail. Clearly the behavior differs as SLES presents 720x400 px resolution in BIOS, but Leap does 800x600 px.
I suggest to migrate the machine, as well as openqaw5-xen.qa.suse.de, to what we use for openQA workers, i.e. Leap, without the openqa-worker started there.
Then we would have on svirt backend the same configuration we have for qemu backend, as there are SUTs started on Leap as well, if I understand it correctly.
Updated by michalnowak about 8 years ago
Actually it's reproducible on Leap 42.2. as well. Even if I update Qemu to 2.8.0 (the latest). I believe it's a Qemu bug, reported as bsc#1016968. Strangely I am unable to reproduce it with Leap 42.1 JeOS as a guest VM.
Updated by okurz about 8 years ago
IMHO the "virtualization" team should test virtualization from both perspective of hypervisor host as well as guest. The current "virtualization" job group concentrates more on the hypervisor, the svirt backend is much more suitable for testing the guest as SUT under the control of different hypervisors. As this ticket here refers to the latter while taking the backend as a testing prerequisite I agree with you that the infrastructure should be more harmonized and the worker host should have the same installation as the other worker hosts. But aren't the tests running on openqaworker2 which is at least as of today Leap 42.2?
Updated by michalnowak about 8 years ago
okurz wrote:
But aren't the tests running on openqaworker2 which is at least as of today Leap 42.2?
I don't speak about openQA worker, but the virtualization hosts openqaw5-xen.qa.suse.de and openqaw6-kvm.qa.suse.de we use as intermediaries. They are the VM hosts where VMs via svirt backend are actually started, i.e. they are not started on the openQA worker itself as it is with qemu backend.
Updated by okurz about 8 years ago
I see, can you document the purpose and setup of these hosts on https://wiki.microfocus.net/index.php/OpenQA#Hardware accordingly? or is it documented already somewhere else? then please reference the according documentation.
Updated by michalnowak about 8 years ago
- Category deleted (
132)
okurz wrote:
I see, can you document the purpose and setup of these hosts on https://wiki.microfocus.net/index.php/OpenQA#Hardware accordingly? or is it documented already somewhere else? then please reference the according documentation.
Added some information to https://wiki.microfocus.net/index.php/OpenQA#Hardware, hopefully it's sufficient.
Updated by okurz about 8 years ago
- Category set to 168
good. But now the question still holds how the machines are administered? AFAICS https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/top.sls will not touch these hosts because no rule matches these. IMHO they should be administered with salt in the same way as the others.
Updated by coolo about 8 years ago
I don't really know what we can do here. The socket is blocking, so a read of 0 means EOF. All we can try is reconnecting
Updated by michalnowak about 8 years ago
coolo wrote:
I don't really know what we can do here. The socket is blocking, so a read of 0 means EOF. All we can try is reconnecting
Yes, that would mitigate the bug in Qemu (https://bugzilla.suse.com/show_bug.cgi?id=1016968).
Updated by michalnowak almost 8 years ago
- Status changed from New to Resolved
With de19bdcc68392abd37adbfe151c0506d02f3525a ZLRE is gone by default, so the problem on svirt is.
The next best thing frequent 'considering VNC stalled - turning black'.
Updated by michalnowak almost 8 years ago
- Status changed from Resolved to New
ZLRE feature removal reverted. The problem is back: https://openqa.suse.de/tests/764295/file/autoinst-log.txt.
I guess I need to disable it explicitly on svirt as Dell was: https://github.com/os-autoinst/os-autoinst/pull/719/commits/99aface7af9cf75f882ac30644f15c26324a37bc.
Updated by coolo almost 8 years ago
Be careful though as for zKVM raw is broken and zlre works :(
Updated by michalnowak almost 8 years ago
- File strace.log.ZRLE.xz strace.log.ZRLE.xz added
While gathering logs for ticket #16418 I noticed, how easy it is to get hit by this issue (95 % of time). Attaching strace log, perhaps there's something useful.
Updated by michalnowak almost 8 years ago
- Blocks action #10206: [tools]libvirt tests (Xen, Hyper-V, VMware) added
Updated by RBrownSUSE almost 8 years ago
michalnowak wrote:
I can reproduce this issue 90 % of times on openqaw6-kvm.qa.suse.de which runs SLES 12 SP2 with updates. However I can't reproduce it on my local machine which runs Leap 42.2 with updates. That's one difference, the other is remote v. local host -- I tried to mitigate impact of timing but of not avail. Clearly the behavior differs as SLES presents 720x400 px resolution in BIOS, but Leap does 800x600 px.
I suggest to migrate the machine, as well as openqaw5-xen.qa.suse.de, to what we use for openQA workers, i.e. Leap, without the openqa-worker started there.
Then we would have on svirt backend the same configuration we have for qemu backend, as there are SUTs started on Leap as well, if I understand it correctly.
If the bug is with SLES 12 SP2, what is the bug ID for the SLES 12 SP2 bug?
I am okay considering the migration to Leap but that is a MAJOR request with huge impacts, we should at least consider you know, fixing the issue in SLE 12 SP2 as people who pay us actually use it..
Updated by michalnowak almost 8 years ago
- Priority changed from Normal to High
RBrownSUSE wrote:
If the bug is with SLES 12 SP2, what is the bug ID for the SLES 12 SP2 bug?
See comment #10.
I am okay considering the migration to Leap but that is a MAJOR request with huge impacts, we should at least consider you know, fixing the issue in SLE 12 SP2 as people who pay us actually use it..
I don't actually need move to Leap, it wouldn't help. SLES 12 SP1 would workaround it, if I am correct in https://bugzilla.suse.com/show_bug.cgi?id=1016968.
Updated by maritawerner almost 8 years ago
- Priority changed from High to Normal
I have set the Bug to P1 now and I hope I will get an answer from the bugowner soonish.
Updated by maritawerner almost 8 years ago
- Priority changed from Normal to High
Updated by okurz almost 8 years ago
Updated by michalnowak almost 8 years ago
Disabled ZRLE on svirt but it did not help either (not to mention other problems), it's just triggered elsewhere:
DIE unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 918.
at /usr/lib/os-autoinst/backend/baseclass.pm line 73.
backend::baseclass::die_handler('unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.p...') called at /usr/lib/os-autoinst/consoles/VNC.pm line 918
consoles::VNC::_receive_update('consoles::VNC=HASH(0x621cdf0)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 880
consoles::VNC::_receive_message('consoles::VNC=HASH(0x621cdf0)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 782
consoles::VNC::update_framebuffer('consoles::VNC=HASH(0x621cdf0)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 68
consoles::vnc_base::request_screen_update('consoles::vnc_base=HASH(0x3b29350)', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 521
backend::baseclass::bouncer('backend::svirt=HASH(0x4830760)', 'request_screen_update', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 504
backend::baseclass::request_screen_update('backend::svirt=HASH(0x4830760)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 167
eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 151
backend::baseclass::run_capture_loop('backend::svirt=HASH(0x4830760)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 122
backend::baseclass::run('backend::svirt=HASH(0x4830760)', 6, 9) called at /usr/lib/os-autoinst/backend/driver.pm line 85
backend::driver::start('backend::driver=HASH(0x48af0f8)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
backend::driver::new('backend::driver', 'svirt') called at /usr/bin/isotovideo line 198
main::init_backend() called at /usr/bin/isotovideo line 263
Updated by coolo almost 8 years ago
- Assignee set to coolo
How would you recommend reproducing this easiest?
Updated by michalnowak almost 8 years ago
coolo wrote:
How would you recommend reproducing this easiest?
Clone any JeOS on Xen HVM e.g. https://openqa.suse.de/tests/858635, make sure all VIRSH_*
variables are set, set VIRSH_INSTANCE
to higher number than 1 not to collude with OSD jobs. On worker file "HDD_1
" has to exist, e.g. as a non-zero sized file, but a image actually used is downloaded on the VM host automatically.
It should fail in GRUB, either in bootloader_uefi, redefine_svirt_domain or console_reboot test module.
Updated by coolo almost 8 years ago
- Status changed from New to Feedback