action #16418
closed[tools][sprint 201712.2] svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes
100%
Description
This happens nearly always on svirt backend when openQA checks (assert_shutdown
) for VM being alive or gone:
02:00:54.5751 Debug: /var/lib/openqa/share/tests/sle/tests/x11/shutdown.pm:122 called testapi::assert_shutdown
02:00:54.5752 13154 <<< testapi::assert_shutdown(timeout=60)
02:00:54.6697 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:54.7792 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:00:55.8770 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:55.9934 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:00:57.0986 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:57.2200 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:00:58.3218 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:58.4442 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:00:59.5340 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:59.6420 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:00.7386 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:01:00.8455 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:01.9331 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:01:02.0630 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:03.1563 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:01:03.2704 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:04.3779 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:01:04.4983 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:09.5313 13155 considering VNC stalled - turning black
03:12:11.0452 13147 signalhandler got TERM - loop 1
https://openqa.suse.de/tests/754073/file/autoinst-log.txt
The call where it stops inside assert_shutdown
is query_isotovideo
's read_json
. Not sure if read_json
comes from query_isotovideo
or is some asynchronous one.
Files
Updated by coolo almost 8 years ago
please run strace -o strace.log -f -tt -s 200 isotovideo -d on the job - and provide it to me
Updated by michalnowak almost 8 years ago
- File strace.log strace.log added
sudo strace -o strace.log -f -tt -s 200 -d -p 27877
PID 27877 is perl /usr/bin/isotovideo -d
. isotovideo
finished while openQA job still looks as if it was running.
Updated by coolo almost 8 years ago
(btw, attaching is not necessarly the best approach. the file will be pretty big. That's why I said, 'provide it')
Updated by michalnowak almost 8 years ago
Join the downloaded files like this: cat xa* > strace.log.xz
(MD5 sum 46d41fec5fb6a3612dd7ed1437e159a4
of the concatenated file) and unxz
it. Log is about 360 MB.
Updated by okurz almost 8 years ago
- Category set to 132
- Target version set to Milestone 6
Updated by michalnowak almost 8 years ago
- Blocks action #10206: [tools]libvirt tests (Xen, Hyper-V, VMware) added
Updated by coolo almost 8 years ago
I have a rough idea what is going on, but I need a test system to svirt against. What is the setup?
Updated by okurz over 7 years ago
- Target version changed from Milestone 6 to Milestone 7
How is that related to https://github.com/os-autoinst/os-autoinst/pull/764 ?
Updated by okurz over 7 years ago
- Related to action #18768: [s390x][zkvm]test fails in or after shutdown with "DIE Can't locate object method "blessed" via package "socket does not exist. Probably your backend instance could not start or died. at /usr/lib/os-autoinst/consoles/VNC.pm line 879." added
Updated by coolo over 7 years ago
- Related to deleted (action #18768: [s390x][zkvm]test fails in or after shutdown with "DIE Can't locate object method "blessed" via package "socket does not exist. Probably your backend instance could not start or died. at /usr/lib/os-autoinst/consoles/VNC.pm line 879.")
Updated by coolo over 7 years ago
- Assignee deleted (
coolo)
it won't reach that point atm - I get
14:25:59.5880 679 MATCH(shutdown-gnome-20140605:0.00)
14:25:59.6714 679 MATCH(shutdown-gnome-20160713:0.00)
14:25:59.8333 677 >>> testapi::_check_backend_response: match=logoutdialog timed out after 15
Updated by michalnowak over 7 years ago
coolo wrote:
Is there any test that fails with that atm?
What about this one https://openqa.suse.de/tests/927099/file/autoinst-log.txt ? System ought to check with ! virsh dominfo openQA-SUT-1 | grep -w 'shut off'
if the VM is down already, but no other virsh dominfo...
check gets finished, when the VM is down.
Updated by michalnowak over 7 years ago
- Status changed from New to Resolved
michalnowak wrote:
What about this one https://openqa.suse.de/tests/927099/file/autoinst-log.txt ? System ought to check with
! virsh dominfo openQA-SUT-1 | grep -w 'shut off'
if the VM is down already, but no othervirsh dominfo...
check gets finished, when the VM is down.
No, that's something unrelated. If the blessed
method was actually located (fixed now), it would die with 'socket does not exist...', which is to be expected if the system is down but console('sut')->disable_vnc_stalls
was not used before shutdown.
Closing as I was unable to trigger it with disable_vnc_stalls
in place. Thanks @coolo.
Updated by michalnowak over 7 years ago
- File 5906-autoinst-log.txt 5906-autoinst-log.txt added
- Status changed from Resolved to New
While working on https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2920 I noticed a hung (see http://assam.suse.cz/tests/5906/file/autoinst-log.txt attached as well):
14:04:20.0091 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:150 called utils::xen_restore_system
14:04:20.0092 29642 <<< testapi::console(testapi_console='sut')
14:04:20.0093 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:150 called utils::xen_restore_system
14:04:20.0094 29642 <<< backend::console_proxy::__ANON__(wrapped_call={
'console' => 'sut',
'function' => 'disable_vnc_stalls',
'args' => []
})
14:04:20.1951 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:150 called utils::xen_restore_system
14:04:20.1953 29642 <<< testapi::assert_shutdown(timeout=60)
14:04:20.5544 29644 Connection to root@openqaw5-xen.qa.suse.de established
14:04:20.7948 29644 Command executed: ! virsh dominfo openQA-SUT-20 | grep -w 'shut off', ret=0
14:04:22.1732 29644 Connection to root@openqaw5-xen.qa.suse.de established
14:04:22.4370 29644 Command executed: ! virsh dominfo openQA-SUT-20 | grep -w 'shut off', ret=0
[ 518.454415] reboot: Restarting system
15:53:11.3939 29639 signalhandler got TERM - loop 1
15:53:11.3940 29639 awaiting death of commands process
15:53:11.3958 29639 commands process exited: 29641
It happened in 1:20 cases.
Updated by michalnowak over 7 years ago
This ticket prevents adding Xen test to SLES:Functional as they fail fairly often on restart or shutdown: https://openqa.suse.de/tests/1002335.
Updated by maritawerner over 7 years ago
- Target version changed from Milestone 7 to Milestone 9
What is the status here? I move that to milestone 9.
Updated by okurz about 7 years ago
- Related to action #28216: [svirt-xen-hvm]test fails with 2h-timeout in "install_and_reboot" trying to restart the machine - xen backend specific restart command failed? added
Updated by okurz about 7 years ago
- Has duplicate action #28331: [sle][functional][svirt-xen-hvm] test fails in shutdown - 2h timeout added
Updated by okurz about 7 years ago
workaround was done by mnowak in the tests: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3980
Updated by okurz about 7 years ago
- Subject changed from svirt: openQA won't perform another query_isotovideo check when VM is down to svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes
- Status changed from New to In Progress
- Priority changed from High to Urgent
Does not seem like the workaround helped: https://openqa.suse.de/tests/1293502/file/autoinst-log.txt
Updated by michalnowak about 7 years ago
David looked into this yesterday, but not sure if he reported his findings, so just FTR.
Clone a JeOS Xen job to you instance:
/usr/share/openqa/script/clone_job.pl --skip-download --from https://openqa.suse.de --host <YOU> 1315294 BACKEND=svirt VIRSH_HOSTNAME=openqaw5-xen.qa.suse.de VIRSH_PASSWORD='nots3cr3t' VIRSH_VMM_FAMILY=xen VIRSH_VMM_TYPE=hvm VIRSH_GUEST=openqaw5-xen.qa.suse.de WORKER_HOSTNAME=<YOU> QEMUCPUS=1 QEMURAM=1024 VIRSH_GUEST_PASSWORD=linux VIRSH_INSTANCE=20
Change revive_xen_domain
test module's run()
like this so it restarts indefinitely and the problem is triggered more easily, and let it run, should hang after ~10th iteration:
sub run {
my ($self) = @_;
my $tries = 100;
do {
print "$tries restarts to go\n";
power_action('reboot');
# If we connect to 'sut' VNC display "too early" the VNC server won't be
# ready we will be left with a blank screen.
sleep 2 if check_var('VIRSH_VMM_FAMILY', 'vmware');
if (check_var('VIRSH_VMM_TYPE', 'hvm')) {
$self->wait_boot;
}
elsif (check_var('VIRSH_VMM_TYPE', 'linux')) {
my $timeout = 80;
wait_serial('Welcome to SUSE Linux', $timeout) || die "System did not boot in $timeout seconds.";
}
select_console 'root-console';
}while($tries--);
# if it didn't died in $tries tries than we're doing it wrong to replicate
}
Updated by coolo about 7 years ago
if you strace the backend, it shows the typical symtoms of a dead ssh channel. Do your tests ssh into the SUT for serial grab? And if so, do you close that tunnel before shutting down?
Updated by michalnowak about 7 years ago
- Assignee set to michalnowak
Updated by dasantiago about 7 years ago
I implemented a polling mechanism to identify when such is happening and to kill the job.
Updated by dasantiago about 7 years ago
- % Done changed from 0 to 100
This can be closed.
Quick dying and informing mechanism implemented and fix implemented in the test
https://github.com/os-autoinst/os-autoinst/pull/902
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4107
https://github.com/os-autoinst/os-autoinst/pull/899
Updated by dasantiago about 7 years ago
- Status changed from In Progress to Resolved
Updated by szarate about 7 years ago
- Subject changed from svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes to [tools][Sprint 201711.2] svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes
Updated by szarate about 7 years ago
- Subject changed from [tools][Sprint 201711.2] svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes to [tools][sprint 201712.2] svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes
Updated by szarate almost 7 years ago
- Target version changed from Current Sprint to Done