action #56993

left-over qemu processes cause incomplete jobs with "Failed to find an available port: Address already in use", previous jobs failed but did not tear down qemu

Added by okurz 5 months ago. Updated 3 months ago.

Status:ResolvedStart date:17/09/2019
Priority:HighDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

Observation

https://openqa.suse.de/tests/3370943/file/autoinst-log.txt on openqaworker-arm-2:10 shows

[2019-09-17T17:29:22.251 UTC] [debug] QEMU: Copyright (c) 2003-2018 Fabrice Bellard and the QEMU Project developers
[2019-09-17T17:29:22.251 UTC] [debug] QEMU: qemu-system-aarch64: -vnc :100,share=force-shared: Failed to find an available port: Address already in use

checking on the worker I can find a qemu process for that worker instance still running:

_openqa+ 36620  0.0  0.0 122100 114932 ?       Ss   Sep12   4:14 /usr/bin/perl /usr/share/openqa/script/worker --instance 10
_openqa+ 46505  1.5  1.0 2045052 1332612 ?     SLl  12:45   7:00  \_ /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -sound

Looking in the worker journal I could find https://openqa.suse.de/tests/3368418/ which looks to be the job that ended up incomplete with leaving this qemu instance around.

This was observed already some time ago – since around 2019-08, potentially even earlier – by at least okurz and riafarov and mitigated by manually terminating that qemu instance. So far okurz is only aware of seeing this on openqaworker-arm-2 in particular.

Problem

First, the openQA worker is looking for a qemu pid file in the pool folder but that one is not there even though that qemu never stopped. That should be checked as well as why the qemu process is not stopped.

Workaround

ssh into the worker and terminate the faulty qemu processes manually.


Related issues

Duplicated by openQA Project - action #57188: Can't syswrite(IO::Socket::UNIX=GLOB(0xaaaae3b7e870), <BU... Rejected 23/09/2019

History

#1 Updated by coolo 5 months ago

The whole Mojo::IOLoop::ReadWriteProcess module was introduced to avoid this problem. I'm afraid it's mojo got lost during latest refactorings as we see this problem now more often again :(

#2 Updated by okurz 5 months ago

  • Duplicated by action #57188: Can't syswrite(IO::Socket::UNIX=GLOB(0xaaaae3b7e870), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 964 added

#3 Updated by okurz 5 months ago

many openQA incompletes on openqaworker4:7 because there seems to be a stuck qemu process in the backend. I will shortly debug and then restart.

openqaworker4:~ # systemctl status openqa-worker@7
● openqa-worker@7.service - openQA Worker #7
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-10-01 03:34:36 CEST; 5h 24min ago
 Main PID: 2195 (worker)
    Tasks: 7 (limit: 4915)
   CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@7.service
           ├─ 2195 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
           └─29272 /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hd>

processes:

_openqa+  2194  0.1  0.0 178828 94484 ?        Ss   03:34   0:20 /usr/bin/perl /usr/share/openqa/script/worker --instance 12
_openqa+ 32378  0.5  0.0 311180 196404 ?       S    07:32   0:28  \_ perl /usr/bin/isotovideo -d
_openqa+ 32381  0.0  0.0 292180 205036 ?       S    07:32   0:02      \_ /usr/bin/isotovideo: comman
_openqa+ 32390  0.4  0.0 1055832 248764 ?      Sl   07:32   0:21      \_ /usr/bin/isotovideo: autote
_openqa+ 32430 10.2  0.7 4880796 1996596 ?     Sl   07:32   8:44      \_ /usr/bin/isotovideo: backen
_openqa+ 32446  3.3  0.0 806832 63864 ?        SNl  07:32   2:52          \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/pool/12/video.ogv
_openqa+ 18077 41.0  0.8 7028160 2197676 ?     Sl   08:46   4:54          \_ /usr/bin/qemu-system-x86_64 -vga cirrus -only-migratable -chardev ringbuf,id=ser
_openqa+  2195  0.0  0.0 157600 73172 ?        Ss   03:34   0:11 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
_openqa+ 29272  0.2  0.6 4177780 1642604 ?     Sl   04:35   0:38  \_ /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0
_openqa+  2196  0.0  0.0 167988 83252 ?        Ss   03:34   0:15 /usr/bin/perl /usr/share/openqa/script/worker --instance 11
_openqa+ 15837  0.2  0.4 5435960 1117444 ?     Sl   06:19   0:25  \_ /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0
_openqa+  2198  0.0  0.0 177720 93284 ?        Ss   03:34   0:15 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+  2199  0.1  0.0 180524 95928 ?        Ss   03:34   0:20 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
_openqa+  2200  0.0  0.0 179344 94996 ?        Ss   03:34   0:19 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
_openqa+  2201  0.0  0.0 165648 80848 ?        Ss   03:34   0:11 /usr/bin/perl /usr/share/openqa/script/worker --instance 3
_openqa+  2202  0.1  0.0 187712 103104 ?       Ss   03:34   0:21 /usr/bin/perl /usr/share/openqa/script/worker --instance 14
_openqa+  2203  0.1  0.0 173140 88500 ?        Ss   03:34   0:20 /usr/bin/perl /usr/share/openqa/script/worker --instance 15
_openqa+  2204  0.0  0.0 181244 97144 ?        Ss   03:34   0:18 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+  2205  0.0  0.0 172404 87820 ?        Ss   03:34   0:14 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
_openqa+  2207  0.0  0.0 174652 89756 ?        Ss   03:34   0:17 /usr/bin/perl /usr/share/openqa/script/worker --instance 10
_openqa+  2208  0.1  0.0 186612 102012 ?       Ss   03:34   0:22 /usr/bin/perl /usr/share/openqa/script/worker --instance 16
_openqa+ 32382  0.6  0.0 359188 243200 ?       S    07:32   0:33  \_ perl /usr/bin/isotovideo -d
_openqa+ 32408  0.0  0.0 294676 207912 ?       S    07:32   0:02      \_ /usr/bin/isotovideo: comman
_openqa+ 32410  0.5  0.1 1136896 329840 ?      Sl   07:32   0:25      \_ /usr/bin/isotovideo: autote
_openqa+ 32482 11.2  0.7 4871584 1987560 ?     Sl   07:32   9:38      \_ /usr/bin/isotovideo: backen
_openqa+ 32498  3.5  0.0 807200 63992 ?        SNl  07:32   3:01          \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/pool/16/video.ogv
_openqa+ 17425 34.2  0.4 6262656 1192180 ?     Sl   08:41   5:56          \_ /usr/bin/qemu-system-x86_64 -vga cirrus -only-migratable -chardev ringbuf,id=ser
_openqa+  2209  0.0  0.0 182480 97728 ?        Ss   03:34   0:18 /usr/bin/perl /usr/share/openqa/script/worker --instance 13
_openqa+  2210  0.0  0.0 175964 91648 ?        Ss   03:34   0:18 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
_openqa+  2211  0.1  0.0 181656 96996 ?        Ss   03:34   0:20 /usr/bin/perl /usr/share/openqa/script/worker --instance 9

from https://openqa.opensuse.org/admin/workers/170 the first faulty job is https://openqa.opensuse.org/tests/1044996

it seems as if instance 11 suffers from the same:

# ls /var/lib/openqa/pool/11/
.locked  qemu.pid

but the corresponding worker reports as "Broken": https://openqa.opensuse.org/admin/workers/163

Stopped both services and restarted. Retriggered jobs with env instance=7 openqa-restart-incompletes-on-worker-instance

This looks related:

Visit https://build.opensuse.org/package/live_build_log/devel:openQA/os-autoinst/openSUSE_Leap_42.3/ppc64le

Last lines of build log:
[  422s] PASS: 01-test_needle.t
…
[  838s] PASS: 17-basetest.t
[24017s] INFO: rcu_sched self-detected stall on CPU
[24017s]    2-...: (1 GPs behind) idle=0f7/140000000000001/0 softirq=53642/53642 fqs=0 
[24017s]     (t=6841 jiffies g=34116 c=34115 q=868)
[24017s] rcu_sched kthread starved for 6841 jiffies! g34116 c34115 f0x0 s3 ->state=0x1
[52824s] qemu-system-ppc64: terminating on signal 15 from pid 80178 (<unknown process>)


Job seems to be stuck here, killed. (after 28800 seconds of inactivity)
[52824s] ### VM INTERACTION END ###
[52824s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl)
[52824s] or the build host has a kernel or hardware problem...

so a qemu process is persisting even though it should have been terminated.

#4 Updated by oorlov 5 months ago

Just in case if additional info may be required. The issue also reproduced on the recent build on aarch64 (openqaworker-arm-1:11) on several test runs: https://openqa.suse.de/tests/3450715, https://openqa.suse.de/tests/3450716, https://openqa.suse.de/tests/3450721, https://openqa.suse.de/tests/3450722

#5 Updated by coolo 5 months ago

  • Target version set to Ready

We really had this under control, this regressed :(

#6 Updated by mkittler 5 months ago

Even though Mojo::IOLoop::ReadWriteProcess should have been taken care QEMU the worker actually has some logic which stops it from taking further jobs when QEMU is still running. Apparently this is broken since the last refactoring. On the other hand ...

the corresponding worker reports as "Broken"

So the check within the worker at least works sometimes. Note that the worker reports itself as broken is expected (despite the fact that Mojo::IOLoop::ReadWriteProcess should have been able to kill QEMU in the first place). Before the worker refactoring the worker behaved similarly. It did not report itself as broken but prevented running further jobs by simply cancelling itself.

#7 Updated by mkittler 5 months ago

Maybe the worker reports itself as broken but still takes new jobs. I'll try to check that.

Reproducing the Mojo::IOLoop::ReadWriteProcess part will be harder. I'm quite often running tests locally and abort them in the middle (e.g. when running the full-stack test) and that doesn't leave QEMU instances running (usually just chromedriver instances). In the failing tests we also see

[2019-09-17T12:45:06.0471 UTC] [warn] [pid:36620] Failed creating CGroup subtree 'Invalid path at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess/CGroup.pm line 35.
', disabling them.You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS

Maybe that's why Mojo::IOLoop::ReadWriteProcess "lost its mojo"? I see (and ignore) this message actually all the time when running tests locally.

According to the worker history the last successful job before the job mentioned in the ticket was https://openqa.suse.de/tests/3368352. It finished normally. The next job failed for unknown reason (no autoinst-log.txt uploaded, https://openqa.suse.de/tests/3368418). The next job already failed in the same way as the job mentioned in the ticket description (https://openqa.suse.de/tests/3368443).

#8 Updated by mkittler 5 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint

We have good tests for not cleaning up the PID file in the pool directory as well as for not taking jobs while the worker is considered broken. But I forgot to ensure that the broken state is actually updated after and before starting a test. That's why it isn't always working. (The broken state is only recomputed in certain intervals but only while idling. So a busy worker would so far never really check whether QEMU is still running until it runs out of jobs.)

#10 Updated by mkittler 5 months ago

PR to fix cgroup creation with Mojo::IOLoop::ReadWriteProcess module: https://github.com/os-autoinst/openQA/pull/2389

With this fix it at least creates the cgroup and passes it to the Mojo::IOLoop::ReadWriteProcess::Container thing. But I've just checked whether this has any actual influence on the cgroup tree (systemd-cgls) but it seems isotovideo and its forks including QEMU are just within the usual session-2.scope. So not a big win.

However, I really doubt that the cgroup creation/usage via Mojo::IOLoop::ReadWriteProcess has ever been working correctly. The log message about the cgroup creation has been there since like forever.

So the real regression is that the worker still took jobs while the QEMU process from the last job was still running. That should be fixed with the PR mentioned in the previous comment.

Previously (= before the worker refactoring) it likely worked like this: QEMU can not be killed immediately -> worker terminates -> systemd might try to restart it but gives up after several attempts -> QEMU terminates after all -> the worker is restarted at some point anyways

Now (= after https://github.com/os-autoinst/openQA/pull/2385 has been merged) the worker should declare itself as broken and re-evaluate that state in certain intervals.

#11 Updated by mkittler 4 months ago

The PRs have been merged and like I already said in the previous comment we're back to the state before the regression. That doesn't mean that the whole cgroup handling within Mojo::IOLoop::ReadWriteProcess works correctly. Not sure whether I should invest more time in fixing it (upstream).

#12 Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

#13 Updated by okurz 4 months ago

There is a potential upstream fix by kraih in https://github.com/mudler/Mojo-IOLoop-ReadWriteProcess/pull/5 , included in release 0.24 , let's wait for the next cycle of autoupdate of the package in devel:languages:perl . Then we should also check that the workaround to delete not working tests from the package in https://build.opensuse.org/package/view_file/devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess/cpanspec.yml?expand=1 are removed again if tests work. After the necessary submission to openSUSE:Factory we can provide a link into https://build.opensuse.org/project/show/devel:openQA:Leap:15.1 to use in production – if it helps.

#14 Updated by mkittler 4 months ago

We haven't seen the problem anymore, right? So we're at least back to where we were before the worker refactoring which means at least the worker wouldn't take any further jobs as long as QEMU is still running. Not sure about Mojo::IOLoop::ReadWriteProcess.

#15 Updated by okurz 4 months ago

mkittler wrote:

We haven't seen the problem anymore, right?

Well, have we even looked? ;) I suggest we grep the logfiles on o3 and osd for the occurence.

[…] Not sure about Mojo::IOLoop::ReadWriteProcess.

On openqaworker4:

# zypper se --verbose perl-Mojo-IOLoop-ReadWriteProcess
Retrieving repository 'Providing openQA dependencies (openSUSE_Leap_15.1)' metadata ------------------------[\]
Subprocess failed. Error: Failed to import public key [5B121667CBDF5E8F-5d724b68] [devel:openQA OBS Project <devel:openQA@build.opensuse.org>] [expires: 2021-11-14]
History:
 - Command exited with status 1.
 - error: /var/tmp/zypp.sFuwe6/pubkey-5B121667CBDF5E8F-XgLMUv: key 1 import failed.
 - error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Read-only file system)

Retrieving repository 'Providing openQA dependencies (openSUSE_Leap_15.1)' metadata .....................[done]
Building repository 'Providing openQA dependencies (openSUSE_Leap_15.1)' cache ..........................[done]
Loading repository data...
Reading installed packages...

S  | Name                              | Type    | Version        | Arch   | Repository          
---+-----------------------------------+---------+----------------+--------+---------------------
i+ | perl-Mojo-IOLoop-ReadWriteProcess | package | 0.23-lp151.1.1 | noarch | openSUSE-Leap-15.1-1

so it seems we do not even have the patched version of perl-Mojo-IOLoop-ReadWriteProcess available nor installed. I did osc linkpac openSUSE:Factory perl-Mojo-IOLoop-ReadWriteProcess devel:openQA:Leap:15.1 to provide it. We should make sure this package will be installed on all workers – after it's built and published.

#16 Updated by mkittler 4 months ago

Like I mentioned in the chat I could recently reproduce the issue locally. While the worker handles it now well (the state is set to broken and also set to online again after manually killing QEMU) I still had a leftover QEMU running on my system. And at this time I also had the new version of perl-Mojo-IOLoop-ReadWriteProcess already installed:

rpm -qa --queryformat '%{name}; %{version}; %{installtime:date}\n' perl-Mojo-IOLoop-ReadWriteProcess
perl-Mojo-IOLoop-ReadWriteProcess; 0.24; Do 24 Okt 2019 14:56:46 CEST

(Why doesn't zypper info show the install date of a package? Even pacman can show the install and the build date of a package.)

#17 Updated by okurz 3 months ago

If you already have the new version of the dependencies I guess you need to set to "In Progress" and work on it or back to "Workable", at best after writing down what are the next steps to undertake

#18 Updated by mkittler 3 months ago

  • Status changed from Feedback to Workable

We should decide whether it is worth the effort to fix perl-Mojo-IOLoop-ReadWriteProcess.

#19 Updated by okurz 3 months ago

IMHO we should do whatever is easiest and lowest effort for us to fix the os-autoinst issue. If the easiest is to fix perl-Mojo-IOLoop-ReadWriteProcess then this, otherwise we just fix it downstream.

#20 Updated by mkittler 3 months ago

  • Status changed from Workable to Resolved
  • Target version changed from Current Sprint to Done

I would even mark this as resolved since we shouldn't run into the problem anymore because the worker catches this. I think that's the best we can do from the downstream side without duplicating things perl-Mojo-IOLoop-ReadWriteProcess is actually supposed to do. I'll add a new ticket for fixing remaining issues within perl-Mojo-IOLoop-ReadWriteProcess.

Follow-up ticket: https://progress.opensuse.org/issues/59852

Also available in: Atom PDF