Project

General

Profile

action #16418

[tools][sprint 201712.2] svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes

Added by michalnowak over 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
michalnowak
Category:
Feature requests
Target version:
Start date:
2017-02-02
Due date:
% Done:

100%

Estimated time:
Difficulty:

Description

This happens nearly always on svirt backend when openQA checks (assert_shutdown) for VM being alive or gone:

02:00:54.5751 Debug: /var/lib/openqa/share/tests/sle/tests/x11/shutdown.pm:122 called testapi::assert_shutdown
02:00:54.5752 13154 <<< testapi::assert_shutdown(timeout=60)
02:00:54.6697 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:54.7792 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:00:55.8770 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:55.9934 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:00:57.0986 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:57.2200 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:00:58.3218 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:58.4442 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:00:59.5340 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:00:59.6420 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:00.7386 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:01:00.8455 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:01.9331 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:01:02.0630 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:03.1563 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:01:03.2704 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0
02:01:04.3779 13155 Connection to root@openqaw6-kvm.qa.suse.de established
02:01:04.4983 13155 Command executed: ! virsh dominfo openQA-SUT-1 | grep -w 'shut off', ret=0

02:01:09.5313 13155 considering VNC stalled - turning black
03:12:11.0452 13147 signalhandler got TERM - loop 1

https://openqa.suse.de/tests/754073/file/autoinst-log.txt

The call where it stops inside assert_shutdown is query_isotovideo's read_json. Not sure if read_json comes from query_isotovideo or is some asynchronous one.

strace.log (716 KB) strace.log michalnowak, 2017-02-06 14:18
xaa (9.54 MB) xaa 1st part of strace.log.xz michalnowak, 2017-02-15 13:41
xab (7.86 MB) xab 2nd part of strace.log.xz michalnowak, 2017-02-15 13:41
5906-autoinst-log.txt (226 KB) 5906-autoinst-log.txt michalnowak, 2017-05-19 10:00

Related issues

Related to openQA Tests - action #28216: [svirt-xen-hvm]test fails with 2h-timeout in "install_and_reboot" trying to restart the machine - xen backend specific restart command failed?Resolved2017-11-22

Has duplicate openQA Tests - action #28331: [sle][functional][svirt-xen-hvm] test fails in shutdown - 2h timeoutRejected2017-11-24

Blocks openQA Tests - action #10206: [tools]libvirt tests (Xen, Hyper-V, VMware)Resolved2016-01-13

History

#1 Updated by coolo over 3 years ago

please run strace -o strace.log -f -tt -s 200 isotovideo -d on the job - and provide it to me

#2 Updated by michalnowak over 3 years ago

sudo strace -o strace.log -f -tt -s 200 -d -p 27877

PID 27877 is perl /usr/bin/isotovideo -d. isotovideo finished while openQA job still looks as if it was running.

#3 Updated by coolo over 3 years ago

ok, nice. Now back to my query from #1

#4 Updated by coolo over 3 years ago

(btw, attaching is not necessarly the best approach. the file will be pretty big. That's why I said, 'provide it')

#5 Updated by michalnowak over 3 years ago

Join the downloaded files like this: cat xa* > strace.log.xz (MD5 sum 46d41fec5fb6a3612dd7ed1437e159a4 of the concatenated file) and unxz it. Log is about 360 MB.

#6 Updated by okurz over 3 years ago

  • Category set to 132
  • Target version set to Milestone 6

#7 Updated by michalnowak over 3 years ago

  • Blocks action #10206: [tools]libvirt tests (Xen, Hyper-V, VMware) added

#8 Updated by michalnowak over 3 years ago

  • Priority changed from Normal to High

#9 Updated by coolo over 3 years ago

I have a rough idea what is going on, but I need a test system to svirt against. What is the setup?

#10 Updated by okurz over 3 years ago

  • Target version changed from Milestone 6 to Milestone 7

#11 Updated by okurz over 3 years ago

  • Related to action #18768: [s390x][zkvm]test fails in or after shutdown with "DIE Can't locate object method "blessed" via package "socket does not exist. Probably your backend instance could not start or died. at /usr/lib/os-autoinst/consoles/VNC.pm line 879." added

#12 Updated by coolo over 3 years ago

  • Related to deleted (action #18768: [s390x][zkvm]test fails in or after shutdown with "DIE Can't locate object method "blessed" via package "socket does not exist. Probably your backend instance could not start or died. at /usr/lib/os-autoinst/consoles/VNC.pm line 879.")

#13 Updated by coolo over 3 years ago

  • Assignee set to coolo

#14 Updated by coolo over 3 years ago

  • Assignee deleted (coolo)

it won't reach that point atm - I get

14:25:59.5880 679 MATCH(shutdown-gnome-20140605:0.00)
14:25:59.6714 679 MATCH(shutdown-gnome-20160713:0.00)
14:25:59.8333 677 >>> testapi::_check_backend_response: match=logoutdialog timed out after 15

#15 Updated by coolo over 3 years ago

Is there any test that fails with that atm?

#16 Updated by michalnowak over 3 years ago

coolo wrote:

Is there any test that fails with that atm?

What about this one https://openqa.suse.de/tests/927099/file/autoinst-log.txt ? System ought to check with ! virsh dominfo openQA-SUT-1 | grep -w 'shut off' if the VM is down already, but no other virsh dominfo... check gets finished, when the VM is down.

#17 Updated by michalnowak over 3 years ago

  • Status changed from New to Resolved

michalnowak wrote:

What about this one https://openqa.suse.de/tests/927099/file/autoinst-log.txt ? System ought to check with ! virsh dominfo openQA-SUT-1 | grep -w 'shut off' if the VM is down already, but no other virsh dominfo... check gets finished, when the VM is down.

No, that's something unrelated. If the blessed method was actually located (fixed now), it would die with 'socket does not exist...', which is to be expected if the system is down but console('sut')->disable_vnc_stalls was not used before shutdown.

Closing as I was unable to trigger it with disable_vnc_stalls in place. Thanks @coolo.

#18 Updated by michalnowak over 3 years ago

While working on https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2920 I noticed a hung (see http://assam.suse.cz/tests/5906/file/autoinst-log.txt attached as well):

14:04:20.0091 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:150 called utils::xen_restore_system
14:04:20.0092 29642 <<< testapi::console(testapi_console='sut')
14:04:20.0093 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:150 called utils::xen_restore_system
14:04:20.0094 29642 <<< backend::console_proxy::__ANON__(wrapped_call={
'console' => 'sut',
'function' => 'disable_vnc_stalls',
'args' => []
})
14:04:20.1951 Debug: /var/lib/openqa/share/tests/sle/tests/installation/install_and_reboot.pm:150 called utils::xen_restore_system
14:04:20.1953 29642 <<< testapi::assert_shutdown(timeout=60)
14:04:20.5544 29644 Connection to root@openqaw5-xen.qa.suse.de established
14:04:20.7948 29644 Command executed: ! virsh dominfo openQA-SUT-20 | grep -w 'shut off', ret=0
14:04:22.1732 29644 Connection to root@openqaw5-xen.qa.suse.de established
14:04:22.4370 29644 Command executed: ! virsh dominfo openQA-SUT-20 | grep -w 'shut off', ret=0
[  518.454415] reboot: Restarting system
15:53:11.3939 29639 signalhandler got TERM - loop 1
15:53:11.3940 29639 awaiting death of commands process
15:53:11.3958 29639 commands process exited: 29641

It happened in 1:20 cases.

#19 Updated by michalnowak over 3 years ago

This ticket prevents adding Xen test to SLES:Functional as they fail fairly often on restart or shutdown: https://openqa.suse.de/tests/1002335.

#22 Updated by maritawerner about 3 years ago

  • Target version changed from Milestone 7 to Milestone 9

What is the status here? I move that to milestone 9.

#23 Updated by okurz almost 3 years ago

  • Related to action #28216: [svirt-xen-hvm]test fails with 2h-timeout in "install_and_reboot" trying to restart the machine - xen backend specific restart command failed? added

#24 Updated by coolo almost 3 years ago

  • Target version deleted (Milestone 9)

#25 Updated by okurz almost 3 years ago

  • Has duplicate action #28331: [sle][functional][svirt-xen-hvm] test fails in shutdown - 2h timeout added

#26 Updated by okurz almost 3 years ago

#27 Updated by okurz almost 3 years ago

  • Subject changed from svirt: openQA won't perform another query_isotovideo check when VM is down to svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes
  • Status changed from New to In Progress
  • Priority changed from High to Urgent

Does not seem like the workaround helped: https://openqa.suse.de/tests/1293502/file/autoinst-log.txt

#28 Updated by michalnowak almost 3 years ago

David looked into this yesterday, but not sure if he reported his findings, so just FTR.

Clone a JeOS Xen job to you instance:

/usr/share/openqa/script/clone_job.pl --skip-download --from https://openqa.suse.de --host <YOU> 1315294 BACKEND=svirt VIRSH_HOSTNAME=openqaw5-xen.qa.suse.de VIRSH_PASSWORD='nots3cr3t' VIRSH_VMM_FAMILY=xen VIRSH_VMM_TYPE=hvm VIRSH_GUEST=openqaw5-xen.qa.suse.de WORKER_HOSTNAME=<YOU> QEMUCPUS=1 QEMURAM=1024 VIRSH_GUEST_PASSWORD=linux VIRSH_INSTANCE=20

Change revive_xen_domain test module's run() like this so it restarts indefinitely and the problem is triggered more easily, and let it run, should hang after ~10th iteration:

 sub run {
     my ($self) = @_;
     my $tries = 100;
     do {
    print "$tries restarts to go\n";

    power_action('reboot');
    # If we connect to 'sut' VNC display "too early" the VNC server won't be
    # ready we will be left with a blank screen.
    sleep 2 if check_var('VIRSH_VMM_FAMILY', 'vmware');
    if (check_var('VIRSH_VMM_TYPE', 'hvm')) {
        $self->wait_boot;
    }
    elsif (check_var('VIRSH_VMM_TYPE', 'linux')) {
        my $timeout = 80;
        wait_serial('Welcome to SUSE Linux', $timeout) || die "System did not boot in $timeout seconds.";
    }
    select_console 'root-console';

     }while($tries--);
     # if it didn't died in $tries tries than we're doing it wrong to replicate

 }

See: http://tragicbox.suse.cz/tests/54.

#29 Updated by coolo almost 3 years ago

if you strace the backend, it shows the typical symtoms of a dead ssh channel. Do your tests ssh into the SUT for serial grab? And if so, do you close that tunnel before shutting down?

#31 Updated by dasantiago over 2 years ago

I implemented a polling mechanism to identify when such is happening and to kill the job.

https://github.com/os-autoinst/os-autoinst/pull/902

#32 Updated by dasantiago over 2 years ago

  • % Done changed from 0 to 100

#33 Updated by dasantiago over 2 years ago

  • Status changed from In Progress to Resolved

#34 Updated by dasantiago over 2 years ago

  • Target version set to Current Sprint

#35 Updated by szarate over 2 years ago

  • Subject changed from svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes to [tools][Sprint 201711.2] svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes

#36 Updated by szarate over 2 years ago

  • Subject changed from [tools][Sprint 201711.2] svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes to [tools][sprint 201712.2] svirt: openQA won't perform another query_isotovideo check when VM is down -> incompletes

#37 Updated by szarate over 2 years ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF