action #68938: Try to reduce waiting time in case of qemu (early-)exits auto_review:"QEMU terminated before QMP connection could be established at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 443":retry - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #68938

closed

Try to reduce waiting time in case of qemu (early-)exits auto_review:"QEMU terminated before QMP connection could be established at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 443":retry

Added by okurz almost 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2020-07-14

Due date:

% Done:

Estimated time:

Description

Motivation¶

https://github.com/os-autoinst/os-autoinst/pull/1456 especially https://github.com/os-autoinst/os-autoinst/pull/1456/commits/dd04130e665df2cb908c6712728da7879cb79d8d and https://github.com/os-autoinst/os-autoinst/pull/1456/commits/20d11554af1cb70af99dfee70a8d35b42fba04ef showed us again that the handling of the qemu process is problematic regarding detecting early-exits or maybe exits in general. We waste time within a loop potentially even in cases when the qemu process has stopped already. We should try to reduce the waiting time in case of qemu (early-)exits.

Acceptance criteria¶

AC1: There is no sleeping in case the qemu process is already stopped

Suggestions¶

Look into https://github.com/os-autoinst/os-autoinst/blob/master/OpenQA/Qemu/Proc.pm , especially https://github.com/os-autoinst/os-autoinst/blob/master/OpenQA/Qemu/Proc.pm#L421 if we look for the qemu process to be active. Maybe we just sleep and should do waitpid(…, W_NOHANG) to detect early exits
Use t/18-qemu-options.t (or maybe one of t/17-backend-qemu.t or t/18-qemu.t or t/14-isotovideo.t) to check if the problem actually exists or extend tests accordingly

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Description updated (diff)
Difficulty set to medium

Actions

Copy link

Updated by mkittler over 4 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 4 years ago

Status changed from Workable to In Progress

PR: https://github.com/os-autoinst/os-autoinst/pull/1520

Actions

Copy link

Updated by livdywan over 4 years ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Feedback to Resolved

Executing prove -v -I. t/18-qemu-options.t locally shows that the runtime is 28s before and after the change, so not worse at least :)

I can see the message

[2020-09-01T15:19:23.954 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
QEMU terminated before QMP connection could be established at /home/okurz/local/os-autoinst/os-autoinst/t/../OpenQA/Qemu/Proc.pm line 443.

so it seems there is really no sleeping after qemu has stopped but still the shutdown time of the complete stack is long (again?). I will report a specific ticket about that.

Actions

Copy link

Updated by dzedro over 4 years ago

This should improve error handling, but on this wicked test it does terminate the job and I can't see the reason.

https://openqa.suse.de/tests/4629798

[37m[2020-09-02T11:00:51.990 CEST] [debug] ||| finished t01_bonding_balance_rr wicked/aggregate/sut at 2020-09-02 09:00:51 (42 s)
[0m[37m[2020-09-02T11:00:51.991 CEST] [debug] Loading a VM snapshot lastgood
[0m[37m[2020-09-02T11:00:51.992 CEST] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[1]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":217538,"seconds":1599037212}}
[0m[37m[2020-09-02T11:00:51.992 CEST] [debug] Loading snapshot (Current VM state is running).
[0m[37m[2020-09-02T11:00:52.011 CEST] [debug] EVENT {"event":"STOP","timestamp":{"microseconds":11601,"seconds":1599037252}}
[0m[33m[2020-09-02T11:00:52.013 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[0m[37m[2020-09-02T11:00:54.044 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 19158 (/usr/bin/isotovideo: backen)
[0m[37m[2020-09-02T11:00:54.048 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/9/raid/hd0-overlay1 /var/lib/openqa/pool/9/raid/hd0-overlay2 42949672960
[0m[37m[2020-09-02T11:00:54.071 CEST] [debug] Formatting '/var/lib/openqa/pool/9/raid/hd0-overlay2', fmt=qcow2 size=42949672960 backing_file=/var/lib/openqa/pool/9/raid/hd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[0m[37m[2020-09-02T11:00:54.071 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/9/raid/cd0-overlay1 /var/lib/openqa/pool/9/raid/cd0-overlay2 10619977728
[0m[37m[2020-09-02T11:00:54.093 CEST] [debug] Formatting '/var/lib/openqa/pool/9/raid/cd0-overlay2', fmt=qcow2 size=10619977728 backing_file=/var/lib/openqa/pool/9/raid/cd0-overlay1 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[0m[37m[2020-09-02T11:00:54.094 CEST] [debug] starting: /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -global isa-fdc.driveA= -m 2048 -cpu qemu64 -netdev tap,id=qanet0,ifname=tap8,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:02:81 -netdev tap,id=qanet1,ifname=tap72,script=no,downscript=no -device virtio-net,netdev=qanet1,mac=52:54:00:12:42:81 -boot order=c -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :99,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console1,path=virtio_console1,logfile=virtio_console1.log,logappend=on -device virtconsole,chardev=virtio_console1,name=org.openqa.console.virtio_console1 -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay2-file,filename=/var/lib/openqa/pool/9/raid/hd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay2,file=hd0-overlay2-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay2,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay2-file,filename=/var/lib/openqa/pool/9/raid/cd0-overlay2,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay2,file=cd0-overlay2-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay2,serial=cd0 -incoming defer
[0m[37m[2020-09-02T11:00:54.101 CEST] [debug] Waiting for 0 attempts
[0m[37m[2020-09-02T11:00:54.102 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
QEMU terminated before QMP connection could be established at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 443.

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Resolved to Workable
Priority changed from Normal to Urgent

users on o3 also report some problems with incompletes since yesterday, i.e. since the PR was merged. could https://openqa.opensuse.org/tests/1381351 be related to https://github.com/os-autoinst/os-autoinst/pull/1520 ?

Actions

Copy link

Updated by okurz over 4 years ago

Subject changed from Try to reduce waiting time in case of qemu (early-)exits to Try to reduce waiting time in case of qemu (early-)exits auto_review:"QEMU terminated before QMP connection could be established at /usr/lib/os-autoinst/OpenQA/Qemu/Proc.pm line 443":retry
Status changed from Workable to Feedback

fixed by https://github.com/os-autoinst/os-autoinst/pull/1523 , was manually deployed on all o3 workers and on osd with a retried osd deployment in https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/77459 . we can retrigger all according incomplete jobs now.

Actions

Copy link