action #170209
open"Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M
0%
Description
Observation¶
I received an alert mail about an incomplete job: https://openqa.suse.de/tests/15996418
It fails with:
[2024-11-25T00:19:57.525877Z] [warn] [pid:103270] !!! : qemu-system-x86_64: -vnc :102,share=force-shared: Failed to find an availabale port: Address already in use
I asked in (Slack)[[https://suse.slack.com/archives/C02AJ1E568M/p1732530369296399]] and @tinita observed the same since "a few hours" - apparently all on worker39
https://openqa.suse.de/admin/workers/2898 shows that this started about 12 hours ago with this job: https://openqa.suse.de/tests/15995445
The qemu-system-x86_64
process 1963 is running since 12 hours with -vnc :102,share=force-shared
.
Acceptance criteria¶
- AC1: Affected jobs are restarted automatically
- AC2: We have a better understanding of situations where this can happen (if at all)
Suggestion¶
- Check one more time for bugs – also consider testing (!) – in the code for handling leftover QEMU processes
- Check one more time for bugs – also consider testing (!) – in terminating/killing the process group of isotovideo (in Mojo::…::ReadWriteProcess)
- Add/enable debug logging when starting/stopping isotovideo (maybe on ReadWriteProcess level)
- Consider starting/stopping isotovideo in a process group with low-level Perl code to replicate the error and investigate and potentially replace the problematic Mojo::ReadWriteProcess?
Updated by tinita 16 days ago
- Priority changed from Urgent to High
% ps aux | grep qemu
...
_openqa+ 1963 98.4 0.4 7061392 2206548 ? Sl Nov24 750:16 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 2048 -cpu qemu64 -netdev user,id=qanet0,hostfwd=tcp::39102-:39102 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device qemu-xhci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :102,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/12/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/12/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on,discard=unmap -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-vars-overlay0,unit=1
% date
Mon Nov 25 11:10:00 AM UTC 2024
The process was started 22:28, so it must be from this test:
https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-201
I stopped the worker instance for now:
systemctl stop openqa-worker-auto-restart@12.service
So I think the urgency can be reduced.
Updated by jbaier_cz 16 days ago
- Related to action #166025: qemu process hanging indefinitely blocking the corresponding VNC port size:S added
Updated by mkittler 14 days ago
- Subject changed from "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 to "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 1 day ago
There was another occurrence of this today on worker39 reported by Martin via slack. The test failed similar to the previous report:
[2024-12-10T00:49:11.761843Z] [debug] [pid:121552] starting: /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 1024 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -boot order=c -device qemu-xhci -device usb-tablet -smp 16,threads=2 -enable-kvm -no-shutdown -vnc :103,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/13/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0
[2024-12-10T00:49:11.768614Z] [debug] [pid:121552] Waiting for 0 attempts
dmesg: read kernel buffer failed: Operation not permitted
[2024-12-10T00:49:12.073274Z] [debug] [pid:121552] Waiting for 1 attempts
[2024-12-10T00:49:12.073470Z] [info] [pid:121552] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
QEMU terminated before QMP connection could be established. Check for errors below
[2024-12-10T00:49:12.073805Z] [info] [pid:121552] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-12-10T00:49:12.074785Z] [debug] [pid:121552] Passing remaining frames to the video encoder
[2024-12-10T00:49:12.078140Z] [debug] [pid:121552] Waiting for video encoder to finalize the video
[2024-12-10T00:49:12.078216Z] [debug] [pid:121552] The built-in video encoder (pid 121742) terminated
[2024-12-10T00:49:12.079343Z] [debug] [pid:121552] QEMU: QEMU emulator version 8.2.7 (SUSE Linux Enterprise 15)
[2024-12-10T00:49:12.079413Z] [debug] [pid:121552] QEMU: Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
[2024-12-10T00:49:12.079491Z] [warn] [pid:121552] !!! : qemu-system-x86_64: warning: This family of AMD CPU doesn't support hyperthreading(2)
[2024-12-10T00:49:12.079543Z] [debug] [pid:121552] QEMU: Please configure -smp options properly or try enabling topoext feature.
[2024-12-10T00:49:12.079582Z] [warn] [pid:121552] !!! : qemu-system-x86_64: -vnc :103,share=force-shared: Failed to find an available port: Address already in use
[2024-12-10T00:49:12.080349Z] [debug] [pid:121552] sending magic and exit
[2024-12-10T00:49:12.080650Z] [debug] [pid:120871] received magic close
[2024-12-10T00:49:12.089917Z] [debug] [pid:120871] backend process exited: 0
[2024-12-10T00:49:12.190817Z] [warn] [pid:120871] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
backend::driver::start_vm(backend::driver=HASH(0x55b9f6c2f468)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/bin/isotovideo line 183
eval {...} called at /usr/bin/isotovideo line 178
[2024-12-10T00:49:12.191180Z] [debug] [pid:120871] stopping command server 121330 because test execution ended through exception
[2024-12-10T00:49:12.391590Z] [debug] [pid:120871] done with command server
[2024-12-10T00:49:12.391689Z] [debug] [pid:120871] stopping autotest process 121362
[2024-12-10T00:49:12.592074Z] [debug] [pid:120871] done with autotest process
120871: EXIT 1
[2024-12-10T00:49:12.660935Z] [info] [pid:4058] Isotovideo exit status: 1
[2024-12-10T00:49:12.693330Z] [info] [pid:4058] +++ worker notes +++
[2024-12-10T00:49:12.693534Z] [info] [pid:4058] End time: 2024-12-10 00:49:12
[2024-12-10T00:49:12.693648Z] [info] [pid:4058] Result: died
[2024-12-10T00:49:12.701384Z] [info] [pid:121845] Uploading autoinst-log.txt
There are some more details in the slack thread but I didn't follow closely and don't completely understand what the "cause" of this was.
Updated by gpathak about 18 hours ago · Edited
As @tinita suggested to find the job that actually started the leftover qemu process.
I found that on worker39:13, the job https://openqa.suse.de/tests/16150344/ was cancelled and after looking at the logs https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8085 found that QEMU didn't get signal 15 from isotovideo
.
So, this was the first job on worker39:13 which didn't close the QEMU session gracefully and caused "Address Already in Use" error on further retries.
- https://openqa.suse.de/tests/16150572#line-217
- https://openqa.suse.de/tests/16150578#line-158
- https://openqa.suse.de/tests/16150595#line-175
- https://openqa.suse.de/tests/16150618#line-151
Marius mentioned that we already have an open ticket for the scenario when QEMU sometimes doesn't exit gracefully.
Maybe this is also related: #136238
Updated by tinita about 9 hours ago
I found that both processes causing the hanging qemu process have this in common, with a tiny difference:
https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8091
Fatal glibc error: malloc.c:4189 (_int_malloc): assertion failed: chunk_main_arena (fwd)
https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-5639
Fatal glibc error: malloc.c:4172 (_int_malloc): assertion failed: chunk_main_arena (bck->bk)
In both cases we have ::: backend::driver::_collect_orphan: Driver backend collected unknown process
after that, but that can't be the qemu process, because we know it was still running, causing the actual problem.