Project

General

Profile

Actions

action #166025

closed

qemu process hanging indefinitely blocking the corresponding VNC port size:S

Added by jbaier_cz 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2024-08-30
Due date:
% Done:

0%

Estimated time:

Description

Observation

A user cancelled job https://openqa.suse.de/tests/15291067 (probably) led to a qemu-system-x86_64 process hanging indefinitely blocking the corresponding VNC port. The worker for that slot is happily accepting further jobs leading to incompletes with qemu-system-x86_64: -vnc :111,share=force-shared: Failed to find an available port: Address already in use

The autoinst.log of the cancelled job shows the following which might indicate the problem.

[2024-08-29T09:15:20.806766+02:00] [info] [pid:123148] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  myjsonrpc: remote end terminated connection, stopping at /usr/lib/os-autoinst/myjsonrpc.pm line 43.
[2024-08-29T09:15:20.809153+02:00] [debug] [pid:123148] QEMU: QEMU emulator version 7.1.0 (SUSE Linux Enterprise 15)
[2024-08-29T09:15:20.809288+02:00] [debug] [pid:123148] QEMU: Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
[2024-08-29T09:15:20.810632+02:00] [debug] [pid:123148] sending magic and exit
[2024-08-29T09:15:20.811010+02:00] [warn] [pid:123148] !!! backend::baseclass::run_capture_loop: capture loop failed myjsonrpc: remote end terminated connection, stopping at /usr/lib/os-autoinst/myjsonrpc.pm line 43.

[2024-08-29T09:15:20.811133+02:00] [debug] [pid:123148] sending magic and exit
[2024-08-29T09:15:20.811277+02:00] [info] [pid:123148] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
[2024-08-29T09:15:20.811380+02:00] [debug] [pid:123148] sending magic and exit
Use of uninitialized value $_[2] in substr at /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi/IO/Handle.pm line 475.
        IO::Handle::write(IO::Pipe::End=GLOB(0x5561a3580470), undef, undef) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 338
        Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5561a3578f70), CODE(0x5561a344a6f8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 492
        Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5561a3578f70)) called at /usr/lib/os-autoinst/backend/driver.pm line 72
        backend::driver::start(backend::driver=HASH(0x5561a4af0658)) called at /usr/lib/os-autoinst/backend/driver.pm line 37
        backend::driver::new("backend::driver", "qemu") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 14
        OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
        OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x55619c4c7bf8)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
        OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x55619c4c7bf8)) called at /usr/bin/isotovideo line 182
        eval {...} called at /usr/bin/isotovideo line 177
Use of uninitialized value $_[1] in substr at /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi/IO/Handle.pm line 475.
        IO::Handle::write(IO::Pipe::End=GLOB(0x5561a3580470), undef, undef) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 338
        Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x5561a3578f70), CODE(0x5561a344a6f8)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 492
        Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x5561a3578f70)) called at /usr/lib/os-autoinst/backend/driver.pm line 72
        backend::driver::start(backend::driver=HASH(0x5561a4af0658)) called at /usr/lib/os-autoinst/backend/driver.pm line 37
        backend::driver::new("backend::driver", "qemu") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 14
        OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
        OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x55619c4c7bf8)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
        OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x55619c4c7bf8)) called at /usr/bin/isotovideo line 182
        eval {...} called at /usr/bin/isotovideo line 177

The issue was identified (https://suse.slack.com/archives/C02CANHLANP/p1724984987182769) and after killing the offending process

rfan@sapworker3:~> ps -ef|grep vnc |grep 111
_openqa+ 123419  96634 99 Aug29 ?        19:06:08 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 2048 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -boot once=d -device qemu-xhci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :111,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -device qemu-xhci,id=xhci0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/21/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=usbstick-overlay0-file,filename=/var/lib/openqa/pool/21/raid/usbstick-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=usbstick-overlay0,file=usbstick-overlay0-file,cache.no-flush=on,discard=unmap -device usb-storage,id=usbstick-device,drive=usbstick-overlay0,bootindex=0

the worker looks ok again.

Acceptance criteria

  • AC1: The pool directory cleanup does not remove the QEMU pid file if QEMU is still running (so our detection for QEMU still running actually works).

Suggestions

  • Apparently the qemu pid was deleted along with the pool directory
    • Avoid deleting the pid file unless the process is stopped to our existing check for running QEMU processes actually works (and the worker appears broken as long as QEMU is still running)
  • Maybe qemu got stuck because the pool was deleted while it was running?

Related issues 1 (1 open0 closed)

Related to openQA Project (public) - action #170209: "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:MWorkable2024-11-25

Actions
Actions #1

Updated by livdywan 3 months ago

  • Subject changed from qemu process hanging indefinitely blocking the corresponding VNC port to qemu process hanging indefinitely blocking the corresponding VNC port size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by mkittler 3 months ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler 3 months ago

  • Status changed from Workable to In Progress
Actions #4

Updated by mkittler 3 months ago ยท Edited

Unfortunately our code for checking whether QEMU is running is already working as it should - so the pool directory cleanup takes care not to delete the PID file in case QEMU is still running. I extended the unit tests of the code (https://github.com/os-autoinst/openQA/pull/5912) but couldn't find any mistakes.

I also checked on sapworker3.qe.nue2.suse.org whether openqa-worker-auto-restart@21.service was running in parallel with another service but it really doesn't seem that was the case.

Considering it was possible to kill the process it was probably not stuck on e.g. some uninterruptible IO call. That leaves the question why the openQA worker was not able to kill it. It is normally supposed to kill the entire process group in case any process don't terminate in time. I don't think it'll be easy to find out the exact cause for this but I can dig a little bit in the logs and have a look for obvious mistakes in the code.

This also still leaves the question why the detection for a still running QEMU process didn't work.

EDIT: Looks like the pid file is created by RWP which would also clean it up again but only if the process exited ("collect" event) or when the process cannot be started. The pidfile itself is written after forking. Maybe something went wrong after forking and before the pidfile was created. However, I couldn't find an obvious mistake in the code and without logs it is hard to tell what to look for. The stopping of the process is also handled by RWP. So in theory the process is killed by RWP on os-autoinst level and on openQA-worker level (by killing the process group). I don't know why non of this worked considering it was possible to kill the process manually. Normally the worker and backend would also get stuck if it wasn't possible to kill the process (because RWP uses waitpid($pid, 0) after sending SIGKILL).

TLDR: I don't know what went wrong here. The code in RWP is not written in a way that is easy to read so I might have missed something when looking at the code but I also don't think it is worth spending more time on it.

Actions #5

Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback
Actions #6

Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

The PR was merged and I don't think it makes sense to dig deeper. It it would happens again it would make sense to investigate os-autoinst logs and worker logs more closely (of the job that initially spawned the problematic qemu process) and investigate the state the VM/QEMU process is in more closely.

If this happens more often we can also add a check for the port in the worker availability/error check. (If it is a QEMU slot.)

Actions #7

Updated by jbaier_cz 24 days ago

  • Related to action #170209: "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M added
Actions

Also available in: Atom PDF