Project

General

Profile

Actions

action #170209

open

"Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M

Added by nicksinger 16 days ago. Updated about 9 hours ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2024-11-25
Due date:
% Done:

0%

Estimated time:

Description

Observation

I received an alert mail about an incomplete job: https://openqa.suse.de/tests/15996418
It fails with:

[2024-11-25T00:19:57.525877Z] [warn] [pid:103270] !!! : qemu-system-x86_64: -vnc :102,share=force-shared: Failed to find an availabale port: Address already in use

I asked in (Slack)[[https://suse.slack.com/archives/C02AJ1E568M/p1732530369296399]] and @tinita observed the same since "a few hours" - apparently all on worker39

https://openqa.suse.de/admin/workers/2898 shows that this started about 12 hours ago with this job: https://openqa.suse.de/tests/15995445
The qemu-system-x86_64 process 1963 is running since 12 hours with -vnc :102,share=force-shared.

Acceptance criteria

  • AC1: Affected jobs are restarted automatically
  • AC2: We have a better understanding of situations where this can happen (if at all)

Suggestion

  • Check one more time for bugs – also consider testing (!) – in the code for handling leftover QEMU processes
  • Check one more time for bugs – also consider testing (!) – in terminating/killing the process group of isotovideo (in Mojo::…::ReadWriteProcess)
  • Add/enable debug logging when starting/stopping isotovideo (maybe on ReadWriteProcess level)
  • Consider starting/stopping isotovideo in a process group with low-level Perl code to replicate the error and investigate and potentially replace the problematic Mojo::ReadWriteProcess?

Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #166025: qemu process hanging indefinitely blocking the corresponding VNC port size:SResolvedmkittler2024-08-30

Actions
Actions #1

Updated by tinita 16 days ago

  • Description updated (diff)
Actions #2

Updated by okurz 16 days ago

  • Tags set to reactive work, osd
  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by tinita 16 days ago

  • Priority changed from Urgent to High
% ps aux | grep qemu
...
_openqa+   1963 98.4  0.4 7061392 2206548 ?     Sl   Nov24 750:16 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 2048 -cpu qemu64 -netdev user,id=qanet0,hostfwd=tcp::39102-:39102 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device qemu-xhci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :102,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/12/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/12/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on,discard=unmap -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-vars-overlay0,unit=1
% date
Mon Nov 25 11:10:00 AM UTC 2024

The process was started 22:28, so it must be from this test:
https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-201

I stopped the worker instance for now:

systemctl stop openqa-worker-auto-restart@12.service

So I think the urgency can be reduced.

Actions #4

Updated by tinita 16 days ago

Stopping the instance apparently killed the process, so I started the instance again.

Actions #5

Updated by jbaier_cz 16 days ago

  • Related to action #166025: qemu process hanging indefinitely blocking the corresponding VNC port size:S added
Actions #6

Updated by tinita 16 days ago

I restarted all affected jobs

Actions #7

Updated by mkittler 14 days ago

  • Subject changed from "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 to "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by okurz 8 days ago

  • Priority changed from High to Normal
Actions #9

Updated by nicksinger 1 day ago

There was another occurrence of this today on worker39 reported by Martin via slack. The test failed similar to the previous report:

[2024-12-10T00:49:11.761843Z] [debug] [pid:121552] starting: /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 1024 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -boot order=c -device qemu-xhci -device usb-tablet -smp 16,threads=2 -enable-kvm -no-shutdown -vnc :103,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/13/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0
[2024-12-10T00:49:11.768614Z] [debug] [pid:121552] Waiting for 0 attempts
dmesg: read kernel buffer failed: Operation not permitted
[2024-12-10T00:49:12.073274Z] [debug] [pid:121552] Waiting for 1 attempts
[2024-12-10T00:49:12.073470Z] [info] [pid:121552] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  QEMU terminated before QMP connection could be established. Check for errors below
[2024-12-10T00:49:12.073805Z] [info] [pid:121552] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-12-10T00:49:12.074785Z] [debug] [pid:121552] Passing remaining frames to the video encoder
[2024-12-10T00:49:12.078140Z] [debug] [pid:121552] Waiting for video encoder to finalize the video
[2024-12-10T00:49:12.078216Z] [debug] [pid:121552] The built-in video encoder (pid 121742) terminated
[2024-12-10T00:49:12.079343Z] [debug] [pid:121552] QEMU: QEMU emulator version 8.2.7 (SUSE Linux Enterprise 15)
[2024-12-10T00:49:12.079413Z] [debug] [pid:121552] QEMU: Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
[2024-12-10T00:49:12.079491Z] [warn] [pid:121552] !!! : qemu-system-x86_64: warning: This family of AMD CPU doesn't support hyperthreading(2)
[2024-12-10T00:49:12.079543Z] [debug] [pid:121552] QEMU: Please configure -smp options properly or try enabling topoext feature.
[2024-12-10T00:49:12.079582Z] [warn] [pid:121552] !!! : qemu-system-x86_64: -vnc :103,share=force-shared: Failed to find an available port: Address already in use
[2024-12-10T00:49:12.080349Z] [debug] [pid:121552] sending magic and exit
[2024-12-10T00:49:12.080650Z] [debug] [pid:120871] received magic close
[2024-12-10T00:49:12.089917Z] [debug] [pid:120871] backend process exited: 0
[2024-12-10T00:49:12.190817Z] [warn] [pid:120871] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
    backend::driver::start_vm(backend::driver=HASH(0x55b9f6c2f468)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
    OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
    OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
    OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/bin/isotovideo line 183
    eval {...} called at /usr/bin/isotovideo line 178

[2024-12-10T00:49:12.191180Z] [debug] [pid:120871] stopping command server 121330 because test execution ended through exception
[2024-12-10T00:49:12.391590Z] [debug] [pid:120871] done with command server
[2024-12-10T00:49:12.391689Z] [debug] [pid:120871] stopping autotest process 121362
[2024-12-10T00:49:12.592074Z] [debug] [pid:120871] done with autotest process
120871: EXIT 1
[2024-12-10T00:49:12.660935Z] [info] [pid:4058] Isotovideo exit status: 1
[2024-12-10T00:49:12.693330Z] [info] [pid:4058] +++ worker notes +++
[2024-12-10T00:49:12.693534Z] [info] [pid:4058] End time: 2024-12-10 00:49:12
[2024-12-10T00:49:12.693648Z] [info] [pid:4058] Result: died
[2024-12-10T00:49:12.701384Z] [info] [pid:121845] Uploading autoinst-log.txt

There are some more details in the slack thread but I didn't follow closely and don't completely understand what the "cause" of this was.

Actions #10

Updated by gpathak about 18 hours ago · Edited

As @tinita suggested to find the job that actually started the leftover qemu process.
I found that on worker39:13, the job https://openqa.suse.de/tests/16150344/ was cancelled and after looking at the logs https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8085 found that QEMU didn't get signal 15 from isotovideo.
So, this was the first job on worker39:13 which didn't close the QEMU session gracefully and caused "Address Already in Use" error on further retries.

Marius mentioned that we already have an open ticket for the scenario when QEMU sometimes doesn't exit gracefully.
Maybe this is also related: #136238

Actions #11

Updated by tinita about 9 hours ago

I found that both processes causing the hanging qemu process have this in common, with a tiny difference:
https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8091

Fatal glibc error: malloc.c:4189 (_int_malloc): assertion failed: chunk_main_arena (fwd)

https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-5639

Fatal glibc error: malloc.c:4172 (_int_malloc): assertion failed: chunk_main_arena (bck->bk)    

In both cases we have ::: backend::driver::_collect_orphan: Driver backend collected unknown process after that, but that can't be the qemu process, because we know it was still running, causing the actual problem.

Actions

Also available in: Atom PDF